Podcasts about Apache Airflow

head mastering big data falamos caracter astronomers business intelligence abaixo reduzir data engineering airflow customer education lamberti apache airflow

Play Episode Listen Later Mar 17, 2023 45:51

No episódio de hoje, Luan Moreno e Mateus Oliveira entrevistaram Marc Lamberti, atualmente como Head of Customer na Astronomer.Vamos entender o Apache Airflow em um nível mais avançado para aplicar soluções para problemas de dados no dia-a-dia, visando melhores práticas. Abaixo alguns pontos sobre técnicas avançadas de Apache Airflow:Executar pipelines do Airflow com acesso imediato aos recursos mais recentes. Reduzir o consumo de infraestrutura para tarefas de longa duração.Reduzir latência de tarefa com configuração e dimensionamento automático.Coletar metadados automaticamente por meio do Open Lineage integrado.Falamos também nesse bate-papo sobre os seguintes temas:Características do Apache AirflowAstro Python SDKDynamic TaskAstro CloudApache Airflow vs. Perfect vs. MageAprenda como utilizar o Apache Airflow em um nível mais avançado para orquestrar o seus data pipelines.Marc LambertiMarc Youtube ChannelEngenharia de Dados Academy Luan Moreno = https://www.linkedin.com/in/luanmoreno/

Change Data Capture and Managed Airflow in Azure Data Factory

Azure Friday (HD) - Channel 9

Play Episode Listen Later Mar 17, 2023

Mark Kromer and Abhishek Narain join Scott Hanselman to talk about two new capabilities in Azure Data Factory: Change Data Capture (CDC) and Managed Airflow. Change Data Capture in Azure Data Factory automatically detects data changes at the source without requiring complex designing or coding. Managed Airflow in Azure Data Factory is a managed orchestration service for Apache Airflow that simplifies the creation and management of Airflow environments on which you can operate end-to-end data pipelines at scale. Chapters 00:00 - Introduction 00:40 - Change Data Capture 01:17 - CDC demo 07:59 - Managed Airflow 09:06 - Managed Airflow demo 14:33 - Wrap-up Recommended resources Change data capture resource overview How to capture changed data from ADLS Gen2 to Azure SQL DB using a Change Data Capture (CDC) resource What is Azure Data Factory Managed Airflow? How does Azure Data Factory Managed Airflow work? Azure Data Factory Create a Pay-as-You-Go account (Azure) Create a free account (Azure) Connect Scott Hanselman | Twitter: @SHanselman Mark Kromer | Twitter: @KromerBigData Abhishek Narain | Twitter: @NarainAbhishek Azure Data Studio | Twitter: @AzDataFactory Azure SQL | Twitter: @AzureSQL Azure Friday | Twitter: @AzureFriday Azure | Twitter: @Azure

change data wrap cdc chapters factory capture recommended managed azure airflow scott hanselman apache airflow azure sql db

Change Data Capture and Managed Airflow in Azure Data Factory

Azure Friday (Audio) - Channel 9

Play Episode Listen Later Mar 17, 2023

Mark Kromer and Abhishek Narain join Scott Hanselman to talk about two new capabilities in Azure Data Factory: Change Data Capture (CDC) and Managed Airflow. Change Data Capture in Azure Data Factory automatically detects data changes at the source without requiring complex designing or coding. Managed Airflow in Azure Data Factory is a managed orchestration service for Apache Airflow that simplifies the creation and management of Airflow environments on which you can operate end-to-end data pipelines at scale. Chapters 00:00 - Introduction 00:40 - Change Data Capture 01:17 - CDC demo 07:59 - Managed Airflow 09:06 - Managed Airflow demo 14:33 - Wrap-up Recommended resources Change data capture resource overview How to capture changed data from ADLS Gen2 to Azure SQL DB using a Change Data Capture (CDC) resource What is Azure Data Factory Managed Airflow? How does Azure Data Factory Managed Airflow work? Azure Data Factory Create a Pay-as-You-Go account (Azure) Create a free account (Azure) Connect Scott Hanselman | Twitter: @SHanselman Mark Kromer | Twitter: @KromerBigData Abhishek Narain | Twitter: @NarainAbhishek Azure Data Studio | Twitter: @AzDataFactory Azure SQL | Twitter: @AzureSQL Azure Friday | Twitter: @AzureFriday Azure | Twitter: @Azure

change data wrap cdc chapters factory capture recommended managed azure airflow scott hanselman apache airflow azure sql db

ETL no Airflow de Forma Inteligente e Escalável usando Astro Python SDK com Tatiana Martins, Staff Software Engineer na Astronomer

large projects python orchestrating apache airflow

Play Episode Listen Later Feb 15, 2023 81:52

No episódio de hoje, Luan Moreno e Mateus Oliveira entrevistaram Tatiana Al-Chueyr Martins, atualmente como Engenheira de Software na Astronomer. O Astro Python SDK é um SDK desenvolvido em Python Open-Source criado pela Astronomer, empresa que acelera o Apache Airflow, para tornar simples o processo de ETL. Astro Python SDK oferece os seguintes benefícios:Operações de ETL com Operadores para Abstração de ComplexidadeCarga de Dados de Forma Escalável e Eficiente (Native Transfers)Transformações Utilizando SQL & DataFramesEntrega dos Dados nos Principais Data Warehouses ModernosOperações Dinâmicas e EscaláveisFalamos também nesse bate-papo sobre os seguintes temas:Apache AirflowAstronomerAstro CloudAprenda como o Astro Python SDK pode de fato mudar a forma com que seu time cria e desenvolve pipelines de ETL dentro do Apache Airflow.Tatiana Al-Chueyr MartinsAstro Python SDKAstronomer Luan Moreno = https://www.linkedin.com/in/luanmoreno/

Orchestrating Large and Small Projects With Apache Airflow

The Real Python Podcast

Play Episode Listen Later Jan 27, 2023 54:24

Have you worked on a project that needed an orchestration tool? How do you define the workflow of an entire data pipeline or a messaging system with Python? This week on the show, Calvin Hendryx-Parker is back to talk about using Apache Airflow and orchestrating Python projects.

195: The Cloud Pod can't wait for Azure Ultra Fungible Storage (Premium)!

The Cloud Pod

Play Episode Listen Later Jan 20, 2023 48:49

On The Cloud Pod this week, Amazon announces massive corporate and tech lay offs and S3 Encrypts New Objects By Default, BigQuery multi-statement transactions are now generally available, and Microsoft announces acquisition of Fungible to accelerate datacenter innovation. Thank you to our sponsor, Foghorn Consulting, which provides top notch cloud and DevOps engineers to the world's most innovative companies. Initiatives stalled because you're having trouble hiring? Foghorn can be burning down your DevOps and Cloud backlogs as soon as next week. General News: Amazon to lay off 18,000 corporate and tech workers. [1:11] Episode Highlights ⏰ Amazon S3 Encrypts New Objects By Default. [3:09] ⏰ Announcing the GA of BigQuery multi-statement transactions. [13:04] ⏰ Microsoft announces acquisition of Fungible to accelerate datacenter innovation. [17:14] Top Quote

amazon leader microsoft public ga cloud picking slack backup salesforce storage python initiatives devops azure s3 fungible tcp foghorn bigquery apache spark vpc health insurance portability gartner magic quadrant accountability act hipaa apache airflow amazon eks aws secrets manager cloud pod foghorn consulting

A Day in a Life of a Director of Airflow Engineering with Kaxil Naik at Astronomer

director engineering neste voc big data open source python astro data analytics astronomers principais workflows engenharia naik reduzir data engineering data engineers airflow apache airflow

Play Episode Listen Later Dec 29, 2022 68:09

Neste episódio vamos conhecer o Director of Apache Airflow Engineering da empresa Astronomer, Kaxil Naik.Kaxil Naik vai nos dar uma visão mais apurada sobre o Apache Airflow e os produtos da Astronomer, como desenvolvedor e PMC Committer, apaixonado por produtos Open-Source.O Astro produto da Astronomer oferece os seguintes benefícios:Executar pipelines do Airflow com acesso imediato aos recursos mais recentes.Reduzir o consumo de infraestrutura para tarefas de longa duração.Reduzir latência de tarefa com configuração e dimensionamento automático.Coletar metadados automaticamente por meio do Open Lineage integrado.Além disto vamos falar sobre:Apache Airflow em geral e novidades.Principais casos de uso.Python como linguagem franca.Você vai ouvir isso e muitas outras experiências das trincheiras, trocadas com Luan Moreno e Mateus Oliveira, aqui, no nosso Engenharia de Dados Cast.Kaxil NaikAstronomer Luan Moreno = https://www.linkedin.com/in/luanmoreno/

Data Engineering Course in Tamil Online Catchup #4 | Tamilboomi Online classes

Arumugam's Podcast

Play Episode Listen Later Dec 11, 2022 88:45

In this Live discussion, We had many interesting discussions, we spoke about spark hive interview questions. real-time scenario. How ETL works, Difference between oracle vs Bigdata. Apache Airflow, oozie schedulers. Bigdata Cluster, Spark Executors..Etc --------- Notes: Followup --------- S3 Data locality python runs on jvm Lamda ORC vs parquat https://parquet.apache.org/docs/file-format/ Sorting,Compress & Uncompress https://forum.huawei.com/enterprise/en/orc-vs-parquet/thread/904477-893 --------- If you are interested to learn about new technologies & careers, you'll like our newsletter & Content here. Visit & Sign up : www.tamilboomi.com --------- We offer Online Classes for Cloud Devops & Data Engineering. --------- You can reach out to us & join the group for discussions: --------- Insta: https://www.instagram.com/tamilboomitechnologies/ WhatsApp Group for Discussions: https://chat.whatsapp.com/LuwXgVza8B3EaFmXwkKSwq Whatsapp number: +91 9619663272 Twitter : https://twitter.com/TamilboomiT --------- We Talk about Life, Motivation, and Technology in Tamil and English. New Episodes Weekly twice (Tuesday & Friday). --------- We have three shows : அடிச்சாண்டா Appoinment Orderu : Talks about Career and Entrupreunership பொதுவாச் சொன்னேன்: Talks about General things which we want to share Tamilboomi online Course: Content related to technology and online live courses ---------- Want to appear in our shows or want to contribute? Feel free to reach us! Share and Enjoy! --- Send in a voice message: https://anchor.fm/tamilboomi/message

live english technology career motivation big data sorting tamil online classes data engineering whatsapp group compress apache airflow cloud devops

Daten Orchestrierung im Data Engineering - Data Orchestration

Der Data Analytics Podcast

Play Episode Listen Later Nov 15, 2022 3:17

Als Data Engineer ist es erforderlich die einzelnen Prozessschritte der Datenreise zu orchestrieren. In diesem kurzen Podcast gehe ich auf das Konzept und die Vorteile der Verwendung einer solchen Software ein. Data Orchestration und Apache Airflow als bekannte Softwarelösung

software konzept vorteile daten verwendung orchestration data engineering softwarel orchestrierung apache airflow prozessschritte

The Creator of Airflow About His Recipe for Smart Data-Driven Companies

The Data Engineering Show

Play Episode Listen Later Aug 3, 2022 45:56

According to Maxime Beauchemin, CEO & Founder at Preset and Creator of Apache Superset and Apache Airflow, building a thriving company is not so straight-forward. So how did he do it? Choosing the right system and services is key for a successful start, and can help you avoid the chaos of having too many tools spread across multiple teams. Max walks the Bros through his recipe for a smart data-driven company, and the genesis of Airflow, Superset & Presto (with some great tidbits about Airflow's old school marketing approach and how the open source platform took on a life of its own).

ceo founders creator data companies recipes analytics bros data driven presto data engineering airflow supersets preset smart data apache airflow maxime beauchemin

How Preset Built a Data-Driven Organization from the Ground Up

The Data Engineering Show

Play Episode Listen Later Aug 3, 2022 45:56

According to Maxime Beauchemin, CEO & Founder at Preset and Creator of Apache Superset and Apache Airflow, it's not so straight-forward to understand what you're really getting into and the vastness of the skills that are required in order to build a thriving company.Picking the right system and services is key for a successful start, and can help you avoid the chaos of having too many tools spread across multiple teams.Plus, Max walks the bros through the genesis of Airflow, Superset & Presto, and Airflow's old school marketing approach that won the hearts of developers across the world. And just like the terminator, once the machine takes over, you can't stop.

ceo founders creator data built picking analytics data driven ground up presto data engineering airflow supersets preset apache airflow maxime beauchemin

Episode 95: Open-Source DataOps, Building In Public, and Remote Work Culture with Douwe Maan

Play Episode Listen Later Jul 1, 2022 73:11

Show Notes(01:46) Douwe went over formative experiences catching the programming virus at the age of 9, combining high school with freelance web development, and studying Computer Science at Utrecht University in college.(03:55) Douwe shared the story behind founding a startup called Stinngo, which led him to join GitLab in 2015 as employee number 10.(05:29) Douwe provided insights on attributes of exceptional engineering talent, given his time hiring developers and eventually becoming GitLab's first Development Lead.(08:28) Douwe unpacked the evolution of his engineering career at GitLab.(11:11) Douwe discussed the motivation behind the creation of the Meltano project in August 2018 to help GitLab's internal data team address the gaps that prevent them from understanding the effectiveness of business operations.(14:38) Douwe reflected on his decision in 2019 to leave GitLab's engineering organization and join the then 5-people Meltano team full-time.(20:24) Douwe shared the details about Meltano's product development journey from its Version 1 to its pivot.(26:18) Douwe reflected on the mental aspect of being the sole person whom Meltano depended on for a while.(29:20) Douwe explained the positioning of Meltano as an open-source self-hosted platform for running data integration and transformation pipelines.(34:54) Douwe shared details of Meltano's ideal customer profiles.(37:45) Douwe provided a quick tour of the Meltano project, which represents the single source of truth regarding one's ELT pipelines: how data should be integrated and transformed, how the pipelines should be orchestrated, and how the various plugins that make up the pipelines should be configured.(40:39) Douwe unpacked different components of Meltano's product strategy, including Meltano SDK, Meltano Hub, and Meltano Labs.(45:05) Douwe discussed prioritizing Meltano's product roadmap in order to bring DataOps functionality to every step of the entire data lifecycle.(48:53) Douwe shared the story behind spinning Meltano out of GitLab in June 2021 and raising a $4.2M Seed funding round led by GV to bring the benefits of open source data integration and DataOps to a wider audience.(52:19) Douwe provided his thoughts behind open-source contributors in a way that can generate valuable product feedback for Meltano.(55:43) Douwe shared valuable hiring lessons to attract the right people who align with Meltano's values.(59:04) Douwe shared advice to startup CEOs who are experimenting with the remote work culture in our “new-normal” virtual working environments.(01:04:10) Douwe unpacked Meltano's mission and vision as outlined in this blog post.(01:06:40) Closing segment.Douwe's Contact InfoGitLabLinkedInTwitterGitHubWebsiteMeltano's ResourcesWebsite | Twitter | LinkedIn | GitHub | YouTubeMeltano Documentation | Product | DataOpsMeltano SDK | Meltano Hub | Meltano LabsCompany Handbook | Community | Values | CareersMentioned ContentArticlesHey, data teams - We're working on a tool just for you (Aug 2018)To-do zero, inbox zero, calendar zero: I think that means I'm done (Sep 2019)Meltano graduates to Version 1.0 (Oct 2019)Revisiting the Meltano strategy: a return to our roots (May 2020)Why we are building an open-source platform for ELT pipelines (May 2020)Meltano spins out of GitLab, raises seed funding to bring data integration into the DataOps era (June 2021)Meltano: The strategic foundation of the ideal data stack (Oct 2021)Introducing your DataOps platform infrastructure: Our strategy for the future of data (Nov 2021)Our next step for building the infrastructure for your Modern Data Stack (Dec 2021)PeopleMaxime Beauchemin (Founder and CEO of Preset, Creator of Apache Airflow and Apache Superset, Angel Investor in Meltano)Benn Stancil (Chief Analytics Officer at Mode Analytics, Well-Known Substack Writer)The entire team at dbt LabsNotesMy conversation with Douwe was recorded back in November 2021. Since then, many things have happened at Meltano. I'd recommend:Checking out their updated company valuesReading Douwe's article about the DataOps Operating System on The New StackExamining Douwe's blog post about moving Meltano to GitHubLooking over the announcement of Meltano 2.0 and the additional seed fundingAbout the showDatacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:Listen on SpotifyListen on Apple PodcastsListen on Google PodcastsIf you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

MLOps + BI? // Maxime Beauchemin // MLOps Coffee Sessions #104

MLOps.community

Play Episode Listen Later Jun 24, 2022 51:50

MLOps Coffee Sessions #104 with the creator of Apache Airflow and Apache Superset Maxime Beauchemin, Future of BI co-hosted by Vishnu Rachakonda. // Abstract // Bio Maxime Beauchemin is the founder and CEO of Preset. Original creator of Apache Superset. Max has worked at the leading edge of data and analytics his entire career, helping shape the discipline in influential roles at data-dependent companies like Yahoo!, Lyft, Airbnb, Facebook, and Ubisoft. // MLOps Jobs board https://mlops.pallet.xyz/jobs MLOps Swag/Merch https://www.printful.com/ // Related Links Website: https://www.rungalileo.io/ Trade-Off: Why Some Things Catch On, and Others book by Kevin Maney: https://www.amazon.com/Trade-Off-Some-Things-Catch-Others/dp/0385525958 --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/ Connect with Max on LinkedIn: https://www.linkedin.com/in/maximebeauchemin/ Timestamps: [00:00] Introduction to Maxime Beauchemin [01:28] Takeaways [03:42] Paradigm of data warehouse [06:38] Entity-centric data modeling [11:33] Metadata for metadata [14:24] Problem of data organization for a rapidly scaling organization [18:36] Machine Learning tooling as a subset or of its own [22:28] Airflow: The unsung hero of the data scientists [27:15] Analyzing Airflow [30:44] Disrupting the field [34:45] Solutions to the ladder problem of empowering exploratory work and mortals superpowers with data [38:04] What to watch out for when building for data scientists [41:47] Rapid fire questions [51:12] Wrap up

ceo future coffee wrap original airbnb takeaways yahoo rapid machine learning lyft bi ubisoft disrupting paradigm entity metadata vishnu preset demetrios kevin maney apache airflow connect with us join maxime beauchemin

Azure's Nightmare Year

AWS Morning Brief

Play Episode Listen Later Jun 9, 2022 5:00 Very Popular

Links: Nick Jones' review of the AWS Security Model I linked to previously. Microsoft Azure has seen 6 'nightmare' cloud security flaws over the past year. Unsecured Elasticsearch Data Replaced with Ransom Note AWS Systems Manager announces support for port forwarding to remote hosts using Session Manager When and where to use IAM permissions boundaries Security vulnerability in AWS's Managed Workflows for Apache Airflow

amazon cloud nightmare i am aws devops azure microsoft azure apache airflow last week in aws

Programmers Quickie

Play Episode Listen Later Jun 2, 2022 5:10

The Apache Airflow project is an orchestration project for learning jobs offline managing them it has a web UI it has a scheduler.

ui apache airflow

Episode 91: Collaborative Data Workspace, The Sharing Gap, and Engineering Management with Caitlin Colgrove

Play Episode Listen Later May 13, 2022 65:12

Show Notes(01:37) Caitlin went over her college experience studying Computer Science at Stanford University in the early 2010s.(03:55) Caitlin talked about her teaching experience for CS 106A and CS 103.(07:09) Caitlin shared valuable lessons from completing software engineering internships at Harvard University, Facebook, and Palantir.(10:06) Caitlin walked over technical and organizational challenges during her time at Palantir — building products for both government/commercial customers and working with designers/infrastructure engineers to deliver full-stack applications to the field.(12:01) Caitlin explained why Palantir is composed of “loosely individual startups.”(14:56) Caitlin recalled learning curves during her transition to a tech lead role at Palantir — becoming responsible for the technical architecture and code quality of the product, mentorship and growth of the engineers, and the product direction and prioritization of features.(18:31) Caitlin discussed her time as a Data Engineering Manager at Remix Technologies — leading the team that builds geospatial data pipelines on top of AWS, Postgres/PostGIS, and Apache Airflow.(24:45) Caitlin reflected on valuable leadership and people management lessons absorbed during her transition to growing and developing diverse and inclusive engineering teams.(29:05) Caitlin shared the founding story of Hex, the modern data workspace for teams, alongside her co-founders Barry and Glen.(32:58) Caitlin talked about Hex's ideal users (the “analytically technical” who need better tools to access and manage more sophisticated workflows) and introduced Hex's Logic View.(35:22) Caitlin examined the collaboration challenges in data teams and revealed Hex's Library to address some of the shortcomings.(39:59) Caitlin shared her thoughts on the evolution of data science notebooks.(42:14) Caitlin unpacked the nuanced problem of justifying data ROI to functional stakeholders and described Hex's interactive App Builder.(45:17) Caitlin shared exciting development in the horizon of Hex's product roadmap.(46:37) Caitlin shared valuable hiring lessons to attract the right people who are excited about Hex's mission.(52:10) Caitlin shared the hurdles to find the early design partners and lighthouse customers of Hex.(56:01) Caitlin shared upcoming go-to-market initiatives that she's most excited about for Hex.(58:24) Caitlin shared fundraising advice for founders currently seeking the right investors for their startups.(01:01:42) Closing segment.Caitlin's Contact InfoLinkedInTwitterHex's ResourcesWebsite | Twitter | LinkedInLogic View | App Builder | Knowledge LibraryDocs | Blog | GalleryCustomers | Careers | Integrations | PricingMentioned ContentArticles“Long Live Code” (June 2020)“Don't Tell Your Data Team's ROI Story” (Aug 2020)“The Sharing Gap” (Oct 2020)PeopleTristan Handy (Founder and CEO of dbt Labs)Claire Carroll (Product Manager of Hex, previous Community Manager of dbt Labs)Wes McKinney (Creator of Pandas and Arrow, Co-Founder and CTO of Voltron Data)DeVaris Brown (Co-Founder and CEO of Meroxa)Book“Mindset: The New Psychology of Success” (by Carol Dweck)NotesMy conversation with Caitlin was recorded back in Fall 2021. Since then, many things have happened at Hex. I'd recommend looking at:Caitlin's piece announcing Hex's SOC 2 Type II report to reflect Hex's commitment to securityCaitlin's recent talk at Data Council Austin about implementing reactive notebooks with iPythonThe release of Hex Knowledge Library, a new way to publish and discover data workHex's $16M Series A (led by Redpoint Ventures) and $52M Series B (led by a16z along with Snowflake, Databricks, and existing investors)Hex's increasing list of customers such as AngelList, Fivetran, Hightouch, Loom, Mixpanel, Notion, Ramp, Replicated, SeatGeek, etc.

Episode 87: Product Experimentation, ML Platforms, and Metrics Store with Nick Handel

Play Episode Listen Later Mar 25, 2022 87:28

Show Notes(01:51) Nick shared his formative experiences of her childhood — moving between different schools, becoming interested in Math, and graduating from UCLA at the age of 19.(05:45) Nick recalled working as a quant analyst focused on emerging market debt at BlackRock.(09:57) Nick went over his decision to join Airbnb as a data scientist on their growth team in 2014.(12:17) Nick discussed how data science could be used to drive community growth on the Airbnb platform.(16:35) Nick led the data architecture design and experimentation platform for Airbnb Trips, one of Airbnb's biggest product launches in 2016.(20:40) Nick provided insights on attributes of exceptional data science talent, given his time interviewing hundreds of candidates to build a data science team from 20 to 85+.(23:50) Nick went over his process of leveling up his product management skillset — leading Airbnb's Machine Learning teams and growing the data organization significantly.(26:56) Nick emphasized the importance of flexibility in his work routine.(29:27) Nick unpacked the technical and organizational challenges of designing and fostering the adoption of Bighead, Airbnb's internal framework-agnostic, end-to-end platform for machine learning.(34:54) Nick recalled his decision to leave Airbnb and become the Head of Data at Branch, which delivers world-class financial services to the mobile generation.(37:24) Nick unpacked key takeaways from his Bay Area AI meetup in 2019 called “ML Infrastructure at an Early Stage Startup” related to his work at Branch.(40:55) Nick discussed his decision to pursue a startup idea in the analytics space rather than the ML space.(43:36) Nick shared the founding story of Transform, whose mission is to make data accessible by way of a metrics store.(49:54) Nick walked through the four key capabilities of a metrics store: semantics, performance, governance, and interfaces + introduced Metrics Framework (Transform's capability to create company-wide alignment around key metrics that scale with an organization through a unified framework).(55:58) Nick unpacked Metrics Catalog — Transform's capability to eliminate repetitive tasks by giving everyone a single place to collaborate, annotate data charts, and view personalized data feeds.(59:57) Nick dissected Metrics API — Transform's capability to generate a set of APIs to integrate metrics into any other enterprise tools for enriched data, dimensional modeling, and increased flexibility.(01:02:41) Nick explained how metrics store fit into a modern data analytics stack(01:05:57) Nick shared valuable hiring lessons finding talents who fit with Transform's cultural values.(01:12:27) Nick shared the hurdles his team has to go through while finding early design partners for Transform.(01:15:38) Nick shared upcoming go-to-market initiatives that he's most excited about for Transform.(01:17:46) Nick shared fundraising advice for founders currently seeking the right investors for their startups.(01:20:45) Closing segment.Nick's Contact InfoLinkedInTwitterMediumTransform's ResourcesWebsiteBlogLinkedIn | TwitterMentioned ContentArticles + Talks“ML Infrastructure at an Early Stage” (March 2019)“Why We Founded Transform” (June 2021)“My Experience with Airbnb's Early Metrics Store” (June 2021)“The 4 Pillars of Our Workplace Culture” (Aug 2021)PeopleAirbnb's Metrics Repo Team (Paul Yang, James Mayfield, Will Moss, Jonathan Parks, and Aaron Keys)Maxime Beauchemin (Founder and CEO of Preset, Creator of Apache Airflow and Apache Superset)Emilie Schario (Data Strategist In Residence at Amplify Partners, Previously Head of Data at Netlify)Book“High-Output Management” (by Andy Grove)NotesMy conversation with Nick was recorded back in July 2021. Since then, many things have happened at Transform. I'd recommend:Registering for the Metrics Store Summit that will happen at the end of April 2022Reviewing the piece about 4 Pillars of Transform's Workplace CultureReading Nick's post on the brief history of the metrics storeExploring Transform's integrations with Mode, Hex, and Google SheetsAbout the showDatacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:Listen on SpotifyListen on Apple PodcastsListen on Google PodcastsIf you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode 86: Risk Management, Open-Source Governance, and Negative Engineering with Jeremiah Lowin

head data astronomers inside story apache airflow

Play Episode Listen Later Mar 16, 2022 81:26

Show Notes(01:29) Jeremiah reflected on his academic interest studying Statistics and Economics at Harvard.(05:33) Jeremiah recalled his four years as a market risk manager at King Street Capital Management.(07:18) Jeremiah explained how the training in risk management has made a huge impact in his career as a startup founder.(09:48) Jeremiah then founded his own consultancy Lowin Data Company that designed and built ML systems for time series data.(12:38) Jeremiah mentioned his fascination with the rapid growth of machine learning in the past decade.(15:54) Jeremiah talked about his contribution to the Apache Airflow project and lessons learned about open-source development/governance.(21:48) Jeremiah unpacked the notion of negative engineering and shared the story behind the inception of Prefect.(27:24) Jeremiah dissected Prefect Core, the open-source framework that is stocked with all the necessary components for designing, building, testing, and running powerful data applications.(32:45) Jeremiah went over the advanced enterprise features of Prefect Cloud that complement users of Prefect Core.(36:04) Jeremiah discussed Prefect's product strategy (read his blog post "Toward Dataflow Automation," which distinguishes the difference between what a company makes and what a company sells).(40:44) Jeremiah explained how Prefect users can take advantage of the hybrid execution model.(47:08) Jeremiah walked through Prefect Server and Prefect UI that enable users to run parts of Prefect Cloud locally.(50:27) Jeremiah talked about how his team has gradually open-sourced the Prefect platform.(51:38) Jeremiah explained how Prefect settles into a "success-based pricing" model, where the cost is based entirely on the number of tasks users run successfully each month.(54:15) Jeremiah shared how to nurture a highly active community of open-source contributors to Prefect Core.(58:23) Jeremiah unpacked Prefect's hiring strategy, which emphasizes the importance of hiring a team diverse in thoughts, backgrounds, makeups, and experiences (read this fantastic guide to building a high-performance team on Prefect's website).(01:07:02) Jeremiah shared fundraising advice for founders currently seeking the right investors for their startups.(01:11:53) Jeremiah unpacked the two key pillars central to Prefect's hyper-adoption within the data world: expansion and product.(01:14:09) Closing segment.Jeremiah's Contact InfoLinkedInTwitterMediumGitHubPrefect's ResourcesWebsiteGitHub | Slack | Documentation | Twitter | MeetupCommunity UpdatesThe Prefect Guide to Building A High-Performance Team (April 2021)Prefect CloudPrefect CorePrefect's Hybrid ModelMentioned ContentArticles"Positive and Negative Engineering" (Oct 2018)"The Golden Spike" (Jan 2019)"Prefect is Open-Source!" (March 2019)"Towards Dataflow Automation" (June 2019)"The Prefect Hybrid Model" (Feb 2020)"Project Earth" (March 2020)"Open-Sourcing The Prefect Platform" (March 2020)"Your Code Will Fail (But That's Okay)" (May 2020)"Liftoff: Prefect's Series A" (Feb 2021)"Escape Velocity: Prefect's Series B" (June 2021)Talks and Podcasts"Invest Like The Best" (Jan 2017)"Task Failed Successfully" (PyData DC 2018)"Software Engineering Daily" (April 2020)"The OSS Startup Podcast" (Nov 2021)"The Sequel Show" (Jan 2022)PeopleVicki Boykis (ML Engineer at Tumblr, Newsletter Writer of Normcore Tech)Chris Riccomini (Software Engineer at WePay, Contributor of Airflow, Investor/Advisor at Prefect)Justin Gage (Newsletter Writer of Technically)Books"Creativity Inc." (by Ed Cadmull)"The Hitchhiker's Guide to the Galaxy" (by Douglas Adams, Eoin Colfer, and Thomas Tidholm)"Shoe Dog" (by Phil Knight)NotesMy conversation with Jeremiah was recorded back in July 2021. Since then, many things have happened at Prefect:The 2021 Growth ReportThe releases of Prefect Orion and Prefect Radar as part of the product roadmapThe announcement of Prefect's Premier Partnership Program for trusted partnersThe introduction of Prefect Discourse for data engineersThe latest drop of Prefect 2.0!About the showDatacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:Listen on SpotifyListen on Apple PodcastsListen on Google PodcastsIf you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

What's New in November and at re:Invent 2021

Melbourne AWS User Group

Play Episode Listen Later Jan 26, 2022 97:15

Pull your podcast player out of instant retrieval, because we're discussing re:Invent 2021 as well as the weeks before it. Lots of announcements; big, small, weird, awesome, and anything in between. We had fun with this episode and hope you do too. Find us at melb.awsug.org.au or as @AWSMelb on Twitter. News Finally in Sydney AWS Snowcone SSD is now available in the US East (Ohio), US West (San Francisco), Asia Pacific (Singapore), Asia Pacific (Sydney) and AWS Asia Pacific (Tokyo) regions Amazon EC2 M6i instances are now available in 5 additional regions Serverless Introducing Amazon EMR Serverless in preview Announcing Amazon Kinesis Data Streams On-Demand Announcing Amazon Redshift Serverless (Preview) Introducing Amazon MSK Serverless in public preview Introducing Amazon SageMaker Serverless Inference (preview) Simplify CI/CD Configuration for AWS Serverless Applications and your favorite CI/CD system – General Availability Amazon AppStream 2.0 launches Elastic fleets, a serverless fleet type AWS Chatbot now supports management of AWS resources in Slack (Preview) Lambda AWS Lambda now supports partial batch response for SQS as an event source AWS Lambda now supports cross-account container image pulling from Amazon Elastic Container Registry AWS Lambda now supports mTLS Authentication for Amazon MSK as an event source AWS Lambda now logs Hyperplane Elastic Network Interface (ENI) ID in AWS CloudTrail data events Step Functions AWS Step Functions Synchronous Express Workflows now supports AWS PrivateLink Amplify Introducing AWS Amplify Studio AWS Amplify announces the ability to override Amplify-generated resources using CDK AWS Amplify announces the ability to add custom AWS resources to Amplify-created backends using CDK and CloudFormation AWS Amplify UI launches new Authenticator component for React, Angular, and Vue AWS Amplify announces the ability to export Amplify backends as CDK stacks to integrate into CDK-based pipelines AWS Amplify expands its Notifications category to include in-app messaging (Developer Preview) AWS Amplify announces a redesigned, more extensible GraphQL Transformer for creating app backends quickly Containers Fargate Announcing AWS Fargate for Amazon ECS Powered by AWS Graviton2 Processors ECS Amazon ECS now adds container instance health information Amazon ECS has improved Capacity Providers to deliver faster Cluster Auto Scaling Amazon ECS-optimized AMI is now available as an open-source project Amazon ECS announces a new integration with AWS Distro for OpenTelemetry EKS Amazon EKS on AWS Fargate now Supports the Fluent Bit Kubernetes Filter Amazon EKS adds support for additional cluster configuration options using AWS CloudFormation Visualize all your Kubernetes clusters in one place with Amazon EKS Connector, now generally available AWS Karpenter v0.5 Now Generally Available AWS customers can now find, subscribe to, and deploy third-party applications that run in any Kubernetes environment from AWS Marketplace Other Amazon ECR announces pull through cache repositories AWS App Mesh now supports ARM64-based Envoy Images EC2 & VPC Instances New – EC2 Instances (G5) with NVIDIA A10G Tensor Core GPUs | AWS News Blog Announcing new Amazon EC2 G5g instances powered by AWS Graviton2 processors Introducing Amazon EC2 R6i instances Introducing two new Amazon EC2 bare metal instances Amazon EC2 Mac Instances now support hot attach and detach of EBS volumes Amazon EC2 Mac Instances now support macOS Monterey Announcing Amazon EC2 M1 Mac instances for macOS Announcing preview of Amazon Linux 2022 Elastic Beanstalk supports AWS Graviton-based Amazon EC2 instance types Announcing preview of Amazon EC2 Trn1 instances Announcing new Amazon EC2 C7g instances powered by AWS Graviton3 processors Announcing new Amazon EC2 Im4gn and Is4gen instances powered by AWS Graviton2 processors Introducing the AWS Graviton Ready Program Introducing Amazon EC2 M6a instances AWS Compute Optimizer now offers enhanced infrastructure metrics, a new feature for EC2 recommendations AWS Compute Optimizer now offers resource efficiency metrics Networking AWS price reduction for data transfers out to the internet Amazon Virtual Private Cloud (VPC) customers can now create IPv6-only subnets and EC2 instances Application Load Balancer and Network Load Balancer end-to-end IPv6 support AWS Transit Gateway introduces intra-region peering for simplified cloud operations and network connectivity Amazon Virtual Private Cloud (VPC) announces IP Address Manager (IPAM) to help simplify IP address management on AWS Amazon Virtual Private Cloud (VPC) announces Network Access Analyzer to help you easily identify unintended network access Introducing AWS Cloud WAN Preview Introducing AWS Direct Connect SiteLink Other Recover from accidental deletions of your snapshots using Recycle Bin Amazon EBS Snapshots introduces a new tier, Amazon EBS Snapshots Archive, to reduce the cost of long-term retention of EBS Snapshots by up to 75% Amazon CloudFront now supports configurable CORS, security, and custom HTTP response headers Amazon EC2 now supports access to Red Hat Knowledgebase Amazon EC2 Fleet and Spot Fleet now support automatic instance termination with Capacity Rebalancing AWS announces a new capability to switch license types for Windows Server and SQL Server applications on Amazon EC2 AWS Batch introduces fair-share scheduling Amazon EC2 Auto Scaling Now Supports Predictive Scaling with Custom Metrics Dev & Ops New services Measure and Improve Your Application Resilience with AWS Resilience Hub | AWS News Blog Scalable, Cost-Effective Disaster Recovery in the Cloud | AWS News Blog Announcing general availability of AWS Elastic Disaster Recovery AWS announces the launch of AWS AppConfig Feature Flags in preview Announcing Amazon DevOps Guru for RDS, an ML-powered capability that automatically detects and diagnoses performance and operational issues within Amazon Aurora Introducing Amazon CloudWatch Metrics Insights (Preview) Introducing Amazon CloudWatch RUM for monitoring applications' client-side performance IaC AWS announces Construct Hub general availability AWS Cloud Development Kit (AWS CDK) v2 is now generally available You can now import your AWS CloudFormation stacks into a CloudFormation stack set You can now submit multiple operations for simultaneous execution with AWS CloudFormation StackSets AWS CDK releases v1.126.0 - v1.130.0 with high-level APIs for AWS App Runner and hotswap support for Amazon ECS and AWS Step Functions SDKs AWS SDK for Swift (Developer Preview) AWS SDK for Kotlin (Developer Preview) AWS SDK for Rust (Developer Preview) CICD AWS Proton now supports Terraform Open Source for infrastructure provisioning AWS Proton introduces Git management of infrastructure as code templates AWS App2Container now supports Jenkins for setting up a CI/CD pipeline Other Amazon CodeGuru Reviewer now detects hardcoded secrets in Java and Python repositories EC2 Image Builder enables sharing Amazon Machine Images (AMIs) with AWS Organizations and Organization Units Amazon Corretto 17 Support Roadmap Announced Amazon DevOps Guru now Supports Multi-Account Insight Aggregation with AWS Organizations AWS Toolkits for Cloud9, JetBrains and VS Code now support interaction with over 200 new resource types AWS Fault Injection Simulator now supports Amazon CloudWatch Alarms and AWS Systems Manager Automation Runbooks. AWS Device Farm announces support for testing web applications hosted in an Amazon VPC Amazon CloudWatch now supports anomaly detection on metric math expressions Introducing Amazon CloudWatch Evidently for feature experimentation and safer launches New – Amazon CloudWatch Evidently – Experiments and Feature Management | AWS News Blog Introducing AWS Microservice Extractor for .NET Security AWS Secrets Manager increases secrets limit to 500K per account AWS CloudTrail announces ErrorRate Insights AWS announces the new Amazon Inspector for continual vulnerability management Amazon SQS Announces Server-Side Encryption with Amazon SQS-managed encryption keys (SSE-SQS) AWS WAF adds support for Captcha AWS Shield Advanced introduces automatic application-layer DDoS mitigation Security Hub AWS Security Hub adds support for AWS PrivateLink for private access to Security Hub APIs AWS Security Hub adds three new FSBP controls and three new partners SSO Manage Access Centrally for CyberArk Users with AWS Single Sign-On Manage Access Centrally for JumpCloud Users with AWS Single Sign-On AWS Single Sign-On now provides one-click login to Amazon EC2 instances running Microsoft Windows AWS Single Sign-On is now in scope for AWS SOC reporting Control Tower AWS Control Tower now supports concurrent operations for detective guardrails AWS Control Tower now supports nested organizational units AWS Control Tower now provides controls to meet data residency requirements Deny services and operations for AWS Regions of your choice with AWS Control Tower AWS Control Tower introduces Terraform account provisioning and customization Data Storage & Processing Databases Relational databases Announcing Amazon RDS Custom for SQL Server New Multi-AZ deployment option for Amazon RDS for PostgreSQL and for MySQL; increased read capacity, lower and more consistent write transaction latency, and shorter failover time (Preview) Amazon RDS now supports cross account KMS keys for exporting RDS Snapshots Amazon Aurora supports MySQL 8.0 Amazon RDS on AWS Outposts now supports backups on AWS Outposts Athena Amazon Athena adds cost details to query execution plans Amazon Athena announces cross-account federated query New and improved Amazon Athena console is now generally available Amazon Athena now supports new Lake Formation fine-grained security and reliable table features Announcing Amazon Athena ACID transactions, powered by Apache Iceberg (Preview) Redshift Announcing preview for write queries with Amazon Redshift Concurrency Scaling Amazon Redshift announces native support for SQLAlchemy and Apache Airflow open-source frameworks Amazon Redshift simplifies the use of other AWS services by introducing the default IAM role Announcing Amazon Redshift cross-region data sharing (preview) Announcing preview of SQL Notebooks support in Amazon Redshift Query Editor V2 Neptune Announcing AWS Graviton2-based instances for Amazon Neptune AWS releases open source JDBC driver to connect to Amazon Neptune MemoryDB Amazon MemoryDB for Redis now supports AWS Graviton2-based T4g instances and a 2-month Free Trial Database Migration Service AWS Database Migration Service now supports parallel load for partitioned data to S3 AWS Database Migration Service now supports Kafka multi-topic AWS Database Migration Service now supports Azure SQL Managed Instance as a source AWS Database Migration Service now supports Google Cloud SQL for MySQL as a source Introducing AWS DMS Fleet Advisor for automated discovery and analysis of database and analytics workloads (Preview) AWS Database Migration Service now offers a new console experience, AWS DMS Studio AWS Database Migration Service now supports Time Travel, an improved logging mechanism Other Database Activity Streams now supports Graviton2-based instances Amazon Timestream now offers faster and more cost-effective time series data processing through scheduled queries, multi-measure records, and magnetic storage writes Amazon DynamoDB announces the new Amazon DynamoDB Standard-Infrequent Access table class, which helps you reduce your DynamoDB costs by up to 60 percent Achieve up to 30% better performance with Amazon DocumentDB (with MongoDB compatibility) using new Graviton2 instances S3 Amazon S3 on Outposts now delivers strong consistency automatically for all applications Amazon S3 Lifecycle further optimizes storage cost savings with new actions and filters Announcing the new Amazon S3 Glacier Instant Retrieval storage class - the lowest cost archive storage with milliseconds retrieval Amazon S3 Object Ownership can now disable access control lists to simplify access management for data in S3 Amazon S3 Glacier storage class is now Amazon S3 Glacier Flexible Retrieval; storage price reduced by 10% and bulk retrievals are now free Announcing the new S3 Intelligent-Tiering Archive Instant Access tier - Automatically save up to 68% on storage costs Amazon S3 Event Notifications with Amazon EventBridge help you build advanced serverless applications faster Amazon S3 console now reports security warnings, errors, and suggestions from IAM Access Analyzer as you author your S3 policies Amazon S3 adds new S3 Event Notifications for S3 Lifecycle, S3 Intelligent-Tiering, object tags, and object access control lists Glue AWS Glue DataBrew announces native console integration with Amazon AppFlow AWS Glue DataBrew now supports custom SQL statements to retrieve data from Amazon Redshift and Snowflake AWS Glue DataBrew now allows customers to create data quality rules to define and validate their business requirements FSx Introducing Amazon FSx for OpenZFS Amazon FSx for Lustre now supports linking multiple Amazon S3 buckets to a file system Amazon FSx for Lustre can now automatically update file system contents as data is deleted and moved in Amazon S3 Announcing the next generation of Amazon FSx for Lustre file systems Backup Announcing preview of AWS Backup for Amazon S3 AWS Backup adds support for Amazon Neptune AWS Backup adds support for Amazon DocumentDB (with MongoDB compatibility) AWS Backup provides new resource assignment rules for your data protection policies AWS Backup adds support for VMware workloads Other AWS Lake Formation now supports AWS PrivateLink AWS Transfer Family adds identity provider options and enhanced monitoring capabilities Introducing ability to connect to EMR clusters in different subnets in EMR Studio AWS Snow Family now supports external NTP server configuration Announcing data tiering for Amazon ElastiCache for Redis Now execute python files and notebooks from another notebook in EMR Studio AWS Snow Family launches offline tape data migration capability AI & ML SageMaker Introducing Amazon SageMaker Canvas - a visual, no-code interface to build accurate machine learning models Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists | AWS News Blog Amazon SageMaker now supports inference testing with custom domains and headers from SageMaker Studio Amazon SageMaker Pipelines now supports retry policies and resume Announcing new deployment guardrails for Amazon SageMaker Inference endpoints Amazon announces new NVIDIA Triton Inference Server on Amazon SageMaker Amazon SageMaker Pipelines now integrates with SageMaker Model Monitor and SageMaker Clarify Amazon SageMaker now supports cross-account lineage tracking and multi-hop lineage querying Introducing Amazon SageMaker Inference Recommender Introducing Amazon SageMaker Ground Truth Plus: Create high-quality training datasets without having to build labeling applications or manage the labeling workforce on your own Amazon SageMaker Studio Lab (currently in preview), a free, no-configuration ML service Amazon SageMaker Studio now enables interactive data preparation and machine learning at scale within a single universal notebook through built-in integration with Amazon EMR Other General Availability of Syne Tune, an open-source library for distributed hyperparameter and neural architecture optimization Amazon Translate now supports AWS KMS Encryption Amazon Kendra releases AWS Single Sign-On integration for secure search Amazon Transcribe now supports automatic language identification for streaming transcriptions AWS AI for data analytics (AIDA) partner solutions Introducing Amazon Lex Automated Chatbot Designer (Preview) Amazon Kendra launches Experience Builder, Search Analytics Dashboard, and Custom Document Enrichment Other Cool Stuff In The Works – AWS Canada West (Calgary) Region | AWS News Blog Unified Search in the AWS Management Console now includes blogs, knowledge articles, events, and tutorials AWS DeepRacer introduces multi-user account management Amazon Pinpoint launches in-app messaging as a new communications channel Amazon AppStream 2.0 Introduces Linux Application Streaming Amazon SNS now supports publishing batches of up to 10 messages in a single API request Announcing usability improvements in the navigation bar of the AWS Management Console Announcing General Availability of Enterprise On-Ramp Announcing preview of AWS Private 5G AWS Outposts is Now Available in Two Smaller Form Factors Introducing AWS Mainframe Modernization - Preview Introducing the AWS Migration and Modernization Competency Announcing AWS Data Exchange for APIs Amazon WorkSpaces introduces Amazon WorkSpaces Web Amazon SQS Enhances Dead-letter Queue Management Experience For Standard Queues Introducing AWS re:Post, a new, community-driven, questions-and-answers service AWS Resource Access Manager enables support for global resource types AWS Ground Station launches expanded support for Software Defined Radios in Preview Announcing Amazon Braket Hybrid Jobs for running hybrid quantum-classical workloads on Amazon Braket Introducing AWS Migration Hub Refactor Spaces - Preview Well-Architected Framework Customize your AWS Well-Architected Review using Custom Lenses New Sustainability Pillar for the AWS Well-Architected Framework IoT Announcing AWS IoT RoboRunner, Now Available in Preview AWS IoT Greengrass now supports Microsoft Windows devices AWS IoT Core now supports Multi-Account Registration certificates on IoT Credential Provider endpoint Announcing AWS IoT FleetWise (Preview), a new service for transferring vehicle data to the cloud more efficiently Announcing AWS IoT TwinMaker (Preview), a service that makes it easier to build digital twins AWS IoT SiteWise now supports hot and cold storage tiers for industrial data New connectivity software, AWS IoT ExpressLink, accelerates IoT development (Preview) AWS IoT Device Management Fleet Indexing now supports two additional data sources (Preview) Connect Amazon Connect now enables you to create and orchestrate tasks directly from Flows Amazon Connect launches scheduled tasks Amazon Connect launches Contact APIs to fetch and update contact details programmatically Amazon Connect launches API to configure security profiles programmatically Amazon Connect launches APIs to archive and delete contact flows Amazon Connect now supports contact flow modules to simplify repeatable logic Sponsors CMD Solutions Silver Sponsors Cevo Versent

ip react time travel jenkins iot api aws ml java amplify reinvent apis 500k kafka invent sql notifications git kubernetes automatically elastic angular emr rds mongodb ci cd mysql terraform cloud9 vs code ipv6 redis postgresql ntp sql server cors windows server kms jetbrains cdk amazon s3 lustre outposts authenticator dynamodb amazon ec2 arm64 cloudformation software defined radio t4g amazon sagemaker amazon rds aws fargate sqs apache airflow aws outposts amazon redshift aws cloudformation jdbc amazon ecs sqlalchemy amazon athena amazon cloudfront aws organizations amazon sqs aws cloudtrail aws management console amazon inspector amazon elasticache amazon documentdb aws regions aws privatelink network load balancer amazon fsx amazon msk aws single sign on google cloud sql

#262 So many bots up in your documentation

Python Bytes

Play Episode Listen Later Dec 9, 2021 43:06

Watch the live stream: Watch on YouTube About the show Sponsored by us: Check out the courses over at Talk Python And Brian's book too! Special guest: Leah Cole Brian #1: pytest 7.0.0rc1 Question: Does the new pytest book work with pytest 7? Answer: Yes! I've been working with pytest 7 during final review of all code, and many pytest core developers have been technical reviewers of the book. A few changes in pytest 7 are also the result of me writing the 2nd edition and suggesting (and in one case implementing) improvements. Florian Bruhin's announcement on Twitter “I'm happy to announce that I just released #pytest 7.0.0rc1! After many tricky deprecations, some internal changes, and months of delay due to various issues, it looks like we could finally get a new non-bugfix release this year! (6.2.0 was released in December 2020).” “We invite everyone to test the #pytest prerelease and report any issues - there is a lot that happened, and chances are we broke something we didn't find yet (we broke a lot of stuff we already fixed Smiling face with open mouth and cold sweat). See the release announcement for details: https://docs.pytest.org/en/7.0.x/announce/release-7.0.0rc1.html” Try it out with pip install pytest==7.0.0rc1 For those of you following along at home (we covered pip index briefly in episode 259) to see rc releases with pip index versions, add --pre ex: pip index versions --``pre pytest will include Available versions: 7.0.0rc1, 6.2.5, 6.2.4, and let you know if there's a newer rc available. Highlights from the 7.0.0rc1 changelog pytest.approx() now works on Decimal within mappings/dicts and sequences/lists. Improvements to approx() with sequences of numbers. Example: > assert [1, 2, 3, 4] == pytest.approx([1, 3, 3, 5]) E assert comparison failed for 2 values: E Index | Obtained | Expected E 1 | 2 | 3 +- 3.0e-06 E 3 | 4 | 5 +- 5.0e-06 pytest invocations with --fixtures-per-test and --fixtures have been enriched with: Fixture location path printed with the fixture name. First section of the fixture's docstring printed under the fixture name. Whole of fixture's docstring printed under the fixture name using --verbose option. Never again wonder where a fixture's definition is RunResult method assert_outcomes now accepts a warnings and deselected argument to assert the total number of warnings captured. Helpful for plugin testing. Added pythonpath setting that adds listed paths to sys.path for the duration of the test session. Nice for using pytest for applications, and for including test helper libraries. Improved documentation, including an auto-generated list of plugins. There were 963 this morning. Michael #2: PandasTutor via David Smit Why use this tool? Let's say you're trying to explain what this pandas code does: (dogs[dogs['size'] == 'medium'] .sort_values('type') .groupby('type').median() ) But this doesn't tell you what's going on behind the scenes. What did this code just do? This single code expression has 4 steps (filtering, sorting, grouping, and aggregating), but only the final output is shown. Where were the medium-sized dogs? This code filters for dogs with size "medium", but none of those dogs appear in the original table display (on the left) because they were buried in the middle rows. How were the rows grouped? The output doesn't show which rows were grouped and aggregated together. (Note that printing a pandas.GroupBy object won't display this information either.) If you ran this same code in Pandas Tutor, you can teach students exactly what's going on step-by-step Leah #3: Apache Airflow Workflow orchestration tool the originated at Airbnb and is now part of the Apache Software Foundation author workflows as directed acyclic graphs (DAGs) of tasks Airflow works best with workflows that are mostly static and slowly changing. When DAG structure is similar from one run to the next, it allows for clarity around unit of work and continuity. Typical data analytics workflow is the Extract, Transform, Load (ETL) workflow - I have data somewhere that I need to get (extract), I do something to it (Transform) and I put that result somewhere else (load) Airflow has "Operators" and connectors which enable you to perform common tasks in popular libraries and Cloud providers Let's talk about a sample - I work on GCP so my sample will be GCP based because that's what I use most. One common workflow I see is running Spark jobs in ephemeral Dataproc clusters. I'm actually writing a tutorial demonstrating this now - literally in progress in another tab BigQuery -> Create Dataproc cluster -> Run PySpark Dataproc job -> Store results in GCS -> delete Dataproc cluster Airflow has a really wonderful, active community. Please join us. Brian #4: textwrap.dedent Suggested by Michel Rogers-Vallée Small utility but super useful. Also, built in to Python standard library. BTW, textwrap package has other cool tools you probably didn't know Python could do right out of the box. It's worth reading the docs. dedent akes a multiline string (the ones with tripple quotes). Removes all common whitespace. This allows you to have multi-line strings defined in functions without mucking up your indenting. Example from docs: def test(): # end first line with to avoid the empty line! s = ''' hello world ''' print(repr(s)) # prints ' hellon worldn ' print(repr(dedent(s))) # prints 'hellon worldn' Better example: from textwrap import dedent def multiline_hello_world(): print("hello") print(" world") def test_multiline_hello_world(capsys): expected = dedent(''' hello world ''') multiline_hello_world() actual = capsys.readouterr().out assert actual == expected Michael #5: pip-audit via Dan Bader (from Real Python) Audits Python environments and dependency trees for known vulnerabilities Are your dependencies containing security issues? What about their dependencies, the ones you forgot to list in your requirements files or pin? Just run pip-audit on your requirements file(s) Perfect candidate for pipx Leah #6 - Using bots to manage samples Another part of my job is working with other software engineers in GCP to oversee the maintenance our Python samples We have thousands of samples in hundreds of repos that are part of GCP documentation To ensure consistency and that this wonderful group of Devrel Engineers has time to get their work done and also function as a human, we use a lot of automation Bots do things like keep our dependencies up to date, check for license headers, auto-assign PRs and issues to code-owners, sync repositories with a centralized config, and more the GCP DevRel github automation team has an open source repo with some of the bots they have developed that we use every day and we use whitesource renovatebot to manage our dependencies and keep them up to date Extras Michael: Github CMD/CTRL+K command palette Python 3.10.1 is out Joke: HTTP status code meanings http.cat

The Inside Story of Apache Airflow with Steven Hillion

The Data Wranglers

Play Episode Listen Later Nov 18, 2021 39:03

What data orchestration platform is downloaded more than 10,000 times a day? Data scientist Steven Hillion joins The Data Wranglers Joe Hellerstein and Jeffrey Heer to give the inside story on Apache Airflow, used by data scientists and data engineers around the world. Apache Airflow is managed commercially by Astronomer.io, where Hillion is Head of Data and in his spare time, is writing a book of poems from mathematic formulas. #TheDataWranglers

05 October 2021 Cyber and Tech News

Cyber and Technology with Mike

Play Episode Listen Later Oct 5, 2021 10:56

In today's podcast we cover four crucial cyber and technology topics, including: 1. Facebook services down for six hours due to misconfiguration 2. Apache Airflow legacy products riddled with flaws 3. Kansas county pays ransom following attack 4. Syniverse says threatactor spied on customers for six years I'd love feedback, feel free to send your comments and feedback to | cyberandtechwithmike@gmail.com

kansas whatsapp spies cyber hacking verizon ransom t mobile apache vodafone daily news tech news telefonica twi airflow technology news china mobile apache airflow cyber news syniverse

#330: Apache Airflow Open-Source Workflow with Python

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later Aug 20, 2021 67:50

If you are working with data pipelines, you definitely need to give Apache Airflow a look. This pure-Python workflow framework is one of the most popular and capable out there. You create your workflows by writing Python code using clever language operators and then you can monitor them and even debug them visually once they get started. Stop writing manual code or cron-job based code to create data pipelines check out Airflow. We're joined by three excellent guests from the Airflow community: Jarek Potiuk, Kaxil Naik, and Leah Cole. Links from the show Jarek Potiuk: linkedin.com Kaxil Naik: @kaxil Leah Cole: @leahecole Airflow site: airflow.apache.org Airflow on GitHub: github.com Airflow community: airflow.apache.org UI: github.com Helm Chart for Apache Airflow: airflow.apache.org Airflow Summit: airflowsummit.org Astronomer: astronomer.io Astronomer Registry (Easy to search for official and community Providers): registry.astronomer.io REST API: airflow.apache.org Contributing: github.com Airflow Loves Kubernetes talk: airflowsummit.org Episode transcripts: talkpython.fm Sponsors Talk Python Training AssemblyAI

online training web software developers programming open source python ui data science github workflow online courses providers contributing astronomers cloud computing ide software developers web development mongodb nosql rest apis airflow pycharm apache airflow assemblyai talk python training python3 python2

What's New in May 2021

Melbourne AWS User Group

Play Episode Listen Later Aug 12, 2021 69:25

Once again Arjen, Jean-Manuel, and Guy discuss the latest and greatest announcements from AWS in this roundup of the news of May. Also once again, this was recorded 2 months before it went up, but luckily it's all still relevant. Even the comments about being in lockdown. News Finally in Sydney

miami ios spark lift oracle analytics iot dimension aws java scalable on demand kubernetes 3x cdi 5x kantar ssr arjen postgresql sql server visual studio code ec2 lustre 20x savings plan apache kafka nist cybersecurity framework access management iam auto scaling amazon rds aws fargate apache airflow amazon eks aws sdk aws control tower crls amazon emr amazon fsx aws iot device management

#464: Diving deep into Amazon MWAA

AWS Podcast

Play Episode Listen Later Aug 8, 2021 17:33

Amazon Managed Workflows for Apache Airflow (MWAA) is a managed orchestration service for Apache Airflow that makes it easier to set up and operate data pipelines in the cloud. Today Simon is joined by John Jackson, Senior Product Manager at AWS, to learn all about this recently launched service. They dig into the benefits of Amazon MWAA, why the team built the service, how it fits in with other AWS services, and the commitment to open source. Learn more: https://aws.amazon.com/managed-workflows-for-apache-airflow/ Leave audio feedback: https://bit.ly/2MvZPOL

amazon aws diving deep senior product manager john jackson apache airflow

In this episode, our managing editor Jenn Webb and I speak with Chris White, CTO of Prefect, a startup building tools to help companies build, monitor, and manage dataflows. Prefect originated from lessons Chris and his co-founder learned while they were at Capital One, where they were early users and contributors to related projects like Apache Airflow.Subscribe: Apple • Android • Spotify • Stitcher • Google • RSS.Detailed show notes can be found on The Data Exchange web site.Subscribe to The Gradient Flow Newsletter.

next generation cto detailed capital one orchestration chris white prefect dataflow apache airflow automation system

The Rise of Apache Airflow for Data Pipeline Orchestration with Marc Lamberti

data code analytics nesse pipeline big data ml produ deployment astronomers recomenda orchestration etl data engineers orchestrator scheduler apache spark lamberti apache airflow

Play Episode Listen Later Apr 28, 2021 72:44

Para marcarmos história no podcast, trouxemos um convidado mais que especial para falar sobre Apache Airflow. Marc Lamberti é o head de treinamentos da Astronomer, empresa que acelera a mais famosa ferramenta de orquestração open-source do mundo.Nesse bate papo falamos sobre os seguintes tópicos:* Empresas Utilizando Workflow-as-Code para ETL e ELT * Use-Cases e Melhores Práticas do Apache Airflow para Big Data* Apache Airflow 2.0 e Recomendações para Deployment em Produção* Astronomer e Apache Airflow* Astronomer Registry* Dicas e Recomendações para Iniciantes Luan Moreno = https://www.linkedin.com/in/luanmoreno/

Python en español #9: Tertulia 2020-12-01

Python en español

Play Episode Listen Later Apr 21, 2021 128:09

Persistencia de datos en Python https://podcast.jcea.es/python/9 Escucharme (Jesús Cea) es agotador. ¡Persistencia! Participantes: Eduardo Castro info@ecdesign.es. Jesús Cea, email: jcea@jcea.es, twitter: @jcea, https://blog.jcea.es/, https://www.jcea.es/. Sergio, desde Moaña. Adrián, desde Vigo. Juan Carlos, desde Bilbao. Javier, desde Madrid. Audio editado por Pablo Gómez, twitter: @julebek. La música de la entrada y la salida es "Lightning Bugs", de Jason Shaw. Publicada en https://audionautix.com/ con licencia - Creative Commons Attribution 4.0 International License. [00:52] Amplia comunidad tecnológica en Vigo. [05:22] Contexto y estilo de las tertulias. [08:52] Proyectos importantes e interesantes que pasan desapercibidos. Apache Airflow: https://airflow.apache.org/. Tryton: https://www.tryton.org/. El mundo es muy grande... [12:52] Antes de empezar un proyecto nuevo hay que investigar el estado del ecosistema. [14:12] La mayoría de las charlas Python se centran en librerías concretas. A mí me interesa el lenguaje concreto o técnicas útiles para cualquier programador Python. [16:37] ¿Compatibilidad hacia atrás? Poner límites a la compatibilidad, por salud mental. La compatibilidad hacia atrás te limita a la hora de adoptar novedades en el lenguaje o hacer limpieza en el código. Soportar solo las versiones soportadas de Python. [23:22] ¿Qué pasa si la versión nueva de la librería solo funciona en Python 3 pero la está instalando en Python 2? PIP no permite imprimir nada en pantalla, a menos que sea un error. Las versiones modernas de PIP permiten especificar la compatibilidad con versiones de Python. [27:52] Interfaces de usuario en Python. wxWidgets: https://wxwidgets.org/. Kivi: https://kivy.org/. Usar directamente HTML/JS/CSS con un microservidor en 127.0.0.1. Una ventaja adicional es que permite el acceso remoto. [31:40] Compilar y distribuir módulos binarios Python para MS Windows. ¿Por qué nadie ha sacado un generador de instaladores mutiplataforma? Poder generar un instalador para MS Windows desde Linux? ¿Algún servicio al que mandas código fuente y te devuelva una versión compilada para MS Windows? [38:32] ¡Persistencia! Persistencia de objetos nativos de Python comparado con los ORM. Adaptación de impedancias entre lenguajes: Python/SQL. ZODB: http://www.zodb.org/en/latest/. Durus: https://www.mems-exchange.org/software/DurusWorks/. Ecosistemas pequeños. Migración de versiones. [56:22] Nuevo "resolver" de PIP: https://pyfound.blogspot.com/2020/11/pip-20-3-new-resolver.html. [01:00:52] Diferencia entre "file.readlines()" y "string.splitlines()". CBOR: https://tools.ietf.org/html/rfc7049. JSON no mola. [01:12:07] ¿Ya habeis migrado a Python 3.9? Mejoras. ¿Cual es la versión más antigua que estáis usando?. Python 3.6 es la versión más antigua aún soportado. "Async" pasó a ser una palabra reservada. Mantener la compatibilidad impide usar las novedades del lenguaje, por ejemplo, f-strings o "dataclasses" https://docs.python.org/3/library/dataclasses.html. Paquete externo "dataclasses" para versiones antiguas de Python: https://pypi.org/project/dataclasses/. [01:19:12] Cacheo de números -5..256. En CPython los destructores se invocan inmediatamente. Deuda técnica que hay que pagarla... o no. [01:21:42] Volvemos a persistencia / SQL. Abstracciones. ¿Qué pasa cuando actualizas Python? Actualizaciones de tu programa. Migraciones. [01:34:52] Profundizamos en cómo funciona la persistencia. [01:48:17] Profiling de memoria. memory-profiler https://pypi.org/project/memory-profiler/. tracemalloc https://docs.python.org/3/library/tracemalloc.html. Algunos trucos para ayudar, por ejemplo, etiquetar las estructura de datos. Manhole: https://pypi.org/project/manhole/. Volcar la memoria de un proceso sin matar el proceso: gcore https://www.linux.org/docs/man1/gcore.html. Top 5 Python Memory Profilers https://stackify.com/top-5-python-memory-profilers/. [01:59:22] Cierre de la tertulia y administratrivia. [02:03:37] Arggg, ¡otra vez persistencia! ¡Menudo rollo! Pyramid https://trypyramid.com/. ZODB: http://www.zodb.org/en/latest/. Durus: https://www.mems-exchange.org/software/DurusWorks/.

Drill to Detail Ep.88 'Superset, Preset and the Future of Business Intelligence' with Special Guest Maxime Beauchemin

Play Episode Listen Later Apr 12, 2021 43:55

Maxime Beauchemin returns to the Drill to Detail Podcast and joins Mark Rittman to talk about what's new with Apache Airflow 2.0, the origin story for Apache Superset and now Preset.io, why the future of business intelligence is open source and news on Marquez, a reference implementation of the OpenLineage open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata sponsored by WeWork.The Rise of the Data EngineerDrill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime BeaucheminApache Airflow 2.0 is here!Apache Superset is a modern data exploration and visualization platformThe Future of Business Intelligence is Open SourcePowerful, easy to use data exploration and visualization platform, powered by Apache Superset™Admunsen: Open source data discovery and metadata engineOpenLineageMarquez: Collect, aggregate, and visualize a data ecosystem's metadata

detail drill wework marquez business intelligence data engineers supersets preset apache airflow maxime beauchemin mark rittman

Episode #96: Serverless and Machine Learning with Alexandra Abbas

Serverless Chats

Play Episode Listen Later Apr 12, 2021 44:19

About Alexa AbbasAlexandra Abbas is a Google Cloud Certified Data Engineer & Architect and Apache Airflow Contributor. She currently works as a Machine Learning Engineer at Wise. She has experience with large-scale data science and engineering projects. She spends her time building data pipelines using Apache Airflow and Apache Beam and creating production-ready Machine Learning pipelines with Tensorflow.Alexandra was a speaker at Serverless Days London 2019 and presented at the Tensorflow London meetup.Personal linksTwitter: https://twitter.com/alexandraabbasLinkedIn: https://www.linkedin.com/in/alexandraabbasGitHub: https://github.com/alexandraabbasdatastack.tv's linksWeb: https://datastack.tvTwitter: https://twitter.com/datastacktvYouTube: https://www.youtube.com/c/datastacktvLinkedIn: https://www.linkedin.com/company/datastacktvGitHub: https://github.com/datastacktvLink to the Data Engineer Roadmap: https://github.com/datastacktv/data-engineer-roadmapThis episode is sponsored by CBT Nuggets: cbtnuggets.com/serverless andStackery: https://www.stackery.io/Watch this video on YouTube: https://youtu.be/SLJZPwfRLb8TranscriptJeremy: Hi, everyone. I'm Jeremy Daly, and this is Serverless Chats. Today I'm joined by Alexa Abbas. Hey, Alexa, thanks for joining me.Alexa: Hey, everyone. Thanks for having me.Jeremy: So you are a machine learning engineer at Wise and also the founder of datastack.tv. So I'd love it if you could tell the listeners a little bit about your background and what you do at Wise and what datastack.tv is all about.Alexa: Yeah. So as you said, I'm a machine learning engineer at Wise. So Wise is an international money transfer service. We are aiming for very transparent fees and very low fees compared to banks. So at Wise, basically, designing, maintaining, and developing the machine learning platform, which serves data scientists and analysts, so they can train their models and deploy their models, easily.Datastack.tv is, basically, it's a video service or a video platform for data engineers. So we create bite-sized videos, educational videos, for data engineers. We mostly cover open source topics, because we noticed that some of the open source tools in the data engineering world are quite underserved in terms of educational content. So we create videos about those.Jeremy: Awesome. And then, what about your background?Alexa: So I actually worked as a data engineer and machine learning engineer, so I've always been a data engineer or machine learning engineer in terms of roles. I also worked, for a small amount of time, I worked as a data scientist as well. In terms of education, I did a big data engineering Master's, but actually my Bachelor is economics, so quite a mix.Jeremy: Well, it's always good to have a ton of experience and that diverse perspective. Well, listen, I'm super excited to have you here, because machine learning is one of those things where it probably is more of a buzzword, I think, to a lot of people where every startup puts it in their pitch deck, like, "Oh, we're doing machine learning and artificial intelligence ..." stuff like that. But I think it's important to understand, one, what exactly it is, because I think there's a huge confusion there in terms of what we think of as machine learning, and maybe we think it's more advanced than it is sometimes, as I think there's lower versions of machine learning that can be very helpful.And obviously, this being a serverless podcast, I've heard you speak a number of times about the work that you've done with machine learning and some experiments you've done with serverless there. So I'd love to just pick your brain about that and just see if we can educate the users here on what exactly machine learning is, how people are using it, and where it fits in with serverless and some of the use cases and things like that. So first of all, I think one of the important things to start with anyways is this idea of MLOps. So can you explain what MLOps is?Alexa: Yeah, sure. So really short, MLOps is DevOps for machine learning. So I guess the traditional software engineering projects, you have a streamlined process you can release, really often, really quickly, because you already have all these best practices that all these traditional software engineering projects implement. Machine learning, this is still in a quite early stage and MLOps is in a quite early stage. But what we try to do in MLOps is we try to streamline machine learning projects, as well as traditional software engineering projects are streamlined. So data scientists can train models really easily, and they can release models really frequently and really easily into production. So MLOps is all about streamlining the whole data science workflow, basically.And I guess it's good to understand what the data science workflow is. So I talk a bit about that as well. So before actually starting any machine learning project, the first phase is an experimentation phase. It's a really iterative process when data scientists are looking at the data, they are trying to find features and they are also training many different models; they are doing architecture search, trying different architecture, trying different hyperparameter settings with those models. So it's a really iterative process of trying many models, many features.And then by the end, they probably find a model that they like and that hit the benchmark that they were looking for, and then they are ready to release that model into production. And this usually looks like ... so sometimes they use shadow models, in the beginning, to check if the results are as expected in production as well, and then they actually release into production. So basically MLOps tries to create the infrastructure and the processes that streamline this whole process, the whole life cycle.Jeremy: Right. So the question I have is, so if you're an ML engineer or you're working on these models and you're going through these iterations and stuff, so now you have this, you're ready to release it to production, so why do you need something like an MLOps pipeline? Why can't you just move that into production? Where's the barrier?Alexa: Well, I guess ... I mean, to be honest, the thing is there shouldn't be a barrier. Right now, that's the whole goal of MLOps. They shouldn't feel that they need to do any manual model artifact copying or anything like that. They just, I don't know, press a button and they can release to production. So that's what MLOps is about really and we can version models, we can version the data, things like that. And we can create reproducible experiments. So I guess right now, I think many bits in this whole lifecycle is really manual, and that could be automated. For example, releasing to production, sometimes it's a manual thing. You just copy a model artifact to a production bucket or whatever. So sometimes we would like to automate all these things.Jeremy: Which makes a lot of sense. So then, in terms of actually implementing this stuff, because we hear all the time about CI/CD. If we're talking about DevOps, we know that there's all these tools that are being built and services that are being launched that allow us to quickly move code through some process and get into production. So are there similar tools for deploying models and things like that?Alexa: Well, I think this space is quite crowded. It's getting more and more crowded. I think there are many ... So there are the cloud providers, who are trying to create tools that help these processes, and there are also many third-party platforms that are trying to create the ML platform that everybody uses. So I think there is no go-to thing that everybody uses, so I think there is many tools that we can use.Some examples, for example, TensorFlow is a really popular machine learning library, But TensorFlow, they created a package on top of TensorFlow, which is called TFX, TensorFlow Extended, which is exactly for streamlining this process and serving models easily, So I would say it TFX is a really good example. There is Kubeflow, which is a machine learning toolkit for Kubernetes. I think there are many custom implementations in-house in many companies, they create their own machine learning platforms, their own model serving API, things like that. And like the cloud providers on AWS, we have SageMaker. They are trying to cover many parts of the tech science lifecycle. And on Google Cloud, we have AI Platform, which is really similar to SageMaker.Jeremy: Right. And what are you doing at Wise? Are you using one of those tools? Are you building something custom?Alexa: Yeah, it's a mix actually. We have some custom bits. We have a custom API, serving API, for serving models. But for model training, we are using many things. We are using SageMaker, Notebooks. And we are also experimenting with SageMaker endpoints, which are actually serverless model serving endpoints. And we are also using EMR for model training and data preparation, so some Spark-based things, a bit more traditional type of model training. So it's quite a mix.Jeremy: Right. Right. So I am not well-versed in machine learning. I know just enough to be dangerous. And so I think that what would be really interesting, at least for me, and hopefully be interesting to listeners as well, is just talk about some of these standard tools. So you mentioned things like TensorFlow and then Kubeflow, which I guess is that end-to-end piece of it, but if you're ... Just how do you start? How do you go from, I guess, building and training a model to then productizing it and getting that out? What's that whole workflow look like?Alexa: So, actually, the data science workflow I mentioned, the first bit is that experimentation, which is really iterative, really free, so you just try to find a good model. And then, when you found a good model architecture and you know that you are going to receive new data, let's say, I don't know, I have a day, or whatever, I have a week, then you need to build out a retraining pipeline. And that is, I think, what the productionization of a model really means, that you can build a retraining pipeline, which can automatically pick up new data and then prepare that new data, retrain the model on that data, and release that model into production automatically. So I think that means productionization really.Jeremy: Right. Yeah. And so by being able to build and train a model and then having that process where you're getting that feedback back in, is that something where you're just taking that data and assuming that that is right and fits in the model or is there an ongoing testing process? Is there supervised learning? I know that's a buzzword. I'm not even sure what it means. But those ... I mean, what types of things go into that retraining of the models? Is it something that is just automatic or is it something where you need constant, babysitting's probably the wrong word, but somebody to be monitoring that on a regular basis?Alexa: So monitoring is definitely necessary, especially, I think when you trained your model and you shouldn't release automatically in production just because you've trained a new data. I mentioned this shadow model thing a bit. Usually, after you retrained the model and this retraining pipeline, then you release that model into shadow mode; and then you will serve that model in parallel to your actual product production model, and then you will check the results from your new model against your production model. And that's a manual thing, you need to ... or maybe you can automate it as well, actually. So if it performs like ... If it is comparable with your production model or if it's even better, then you will replace it.And also, in terms of the data quality in the beginning, you should definitely monitor that. And I think that's quite custom, really depends on what kind of data you work with. So it's really important to test your data. I mean, there are many ... This space is also quite crowded. There are many tools that you can use to monitor your distribution of your data and see that the new data is actually corresponds to your already existing data set. So there are many bits that you can monitor in this whole retraining pipeline, and you should monitor.Jeremy: Right. Yeah. And so, I think of some machine learning like use cases of like sentiment analysis, for example... looking at tweets or looking at customer service conversations and trying to rate those things. So when you say monitoring or running them against a shadow model, is that something where ... I mean, how do you gauge what's better, right? if you've got a shadow... I mean, what's the success metric there as to say X number were classified as positive versus negative sentiment? Is that something that requires human review or some sampling for you to kind of figure out the quality of the success of those models?Alexa: Yeah. So actually, I think that really depends on the use case. For example, when you are trying to catch fraudsters, your false positive rate and true positive rate, these are really important. If your true positive rate is higher that means, oh, you are catching more fraudsters. But let's say your new model, with your model, also the false positive rate is higher, which means that you are catching more people who are actually not fraudsters, but you have more work because I guess that's a manual process to actually check those people. So I think it really depends on the use case.Jeremy: Right. Right. And you also said that the markets a little bit flooded and, I mean, I know of SageMaker and then, of course, there's all these tools like, what's it called, Recognition, a bunch of things at AWS, and then Google has a whole bunch of the Vision API and some of these things and Watson's Natural Language Processing over at IBM and some of these things. So there's all these different tools that are just available via an API, which is super simple and great for people like me that don't want to get into building TensorFlow models and things like that. So is there an advantage to building your own models beyond those things, or are we getting to a point where with things like ... I mean, again, I know SageMaker has a whole library of models that are already built for you and things like that. So are we getting to a point where some of these models are just good enough off the shelf or do we really still need ... And I know there are probably some custom things. But do we still really need to be building our own models around that stuff?Alexa: So to be honest, I think most of the data scientists, they are using off-the-shelf models, maybe not the serverless API type of models that Google has, but just off-the-shelf TensorFlow models or SageMaker, they have these built-in containers for some really popular model architectures like XGBoost, and I think most of the people they don't tweak these, I mean, as far as I know. I think they just use them out of the box, and they really try to tweak the data instead, the data that they have, and try to have these off-the-shelf models with higher and higher quality data.Jeremy: So shape the data to fit the model as opposed to the model to fit the data.Alexa: Yeah, exactly. Yeah. So you don't actually have to know ... You don't have to know how those models work exactly. As long as you know what the input should be and what output you expect, then I think you're good to go.Jeremy: Yeah, yeah. Well, I still think that there's probably a lot of value in tuning the models though against your particular data sets.Alexa: Yeah, right. But also there are services for hyperparameter tuning. There are services even for neural architecture search, where they try a lot of different architectures for your data specifically and then they will tell you what is the best model architecture that you should use and same for the hyperparameter search. So these can be automated as well.Jeremy: Yeah. Very cool. So if you are hosting your own version of this ... I mean, maybe you'll go back to the MLOps piece of this. So I would assume that a data scientist doesn't want to be responsible for maintaining the servers or the virtual machines or whatever it is that it's running on. So you want to have this workflow where you can get your models trained, you can get them into production, and then you can run them through this loop you talked about and be able to tweak them and continue to retrain them as things go through. So on the other side of that wall, if we want to put it that way, you have your ops people that are running this stuff. Is there something specific that ops people need to know? How much do they need to know about ML, as opposed to ... I mean, the data scientists, hopefully, they know more. But in terms of running it, what do they need to know about it, or is it just a matter of keeping a server up and running?Alexa: Well, I think ... So I think the machine learning pipelines are not yet as standardized as a traditional software engineering pipeline. So I would say that you have to have some knowledge of machine learning or at least some understanding of how this lifecycle works. You don't actually need to know about research and things like that, but you need to know how this whole lifecycle works in order to work as an ops person who can automate this. But I think the software engineering skills and DevOps skills are the base, and then you can just build this knowledge on top of that. So I think it's actually quite easy to pick this up.Jeremy: Yeah. Okay. And what about, I mean, you mentioned this idea of a lot of data scientists aren't actually writing the models, they're just using the preconfigured model. So I guess that begs the question: How much does just a regular person ... So let's say I'm just a regular developer, and I say, "I want to start building machine learning tools." Is it as easy as just pulling a model off the shelf and then just learning a little bit more about it? How much can the average person do with some of these tools out of the box?Alexa: So I think most of the time, it's that easy, because usually the use cases that someone tries to tackle, those are not super edge cases. So for those use cases, there are already models which perform really well. Especially if you are talking about, I don't know, supervised learning on tabular data, I think you can definitely find models that will perform really well off the shelf on those type of datasets.Jeremy: Right. And if you were advising somebody who wanted to get started... I mean, because I think that I think where it might come down to is going to be things like pricing. If you're using Vision API and you're maybe limited on your quota, and then you can ... if you're paying however many cents per, I guess, lookup or inference, then that can get really expensive as opposed to potentially running your own model on something else. But how would you suggest that somebody get started? Would you point them at the APIs or would you want to get them up and running on TensorFlow or something like that?Alexa: So I think, actually, for a developer, just using an API would be super easy. Those APIs are, I think ... So getting started with those APIs just to understand the concepts are very useful, but I think getting started with Tensorflow itself or just Keras, I definitely I would recommend that, or just use scikit-learn, which is a more basic package for more basic machine learning. So those are really good starting points. And there are so many tutorials to get started with, and if you have an idea of what you would like to build, then I think you will definitely find tutorials which are similar to your own use case and you can just use those to build your custom pipeline or model. So I would say, for developers, I would definitely recommend jumping into TensorFlow or scikit-learn or XGBoost or things like that.Jeremy: Right, right. And how many of these models exist? I mean, are we talking there's 20 different models or are we talking there's 20,000 models?Alexa: Well, I think ... Wow. Good question. I think we are more towards today maybe not 20,000, but definitely many thousands, I think. But there are popular models that most of the people use, and I think there are maybe 50 or 100 models that are the most popular and most companies use them and you are probably fine just using those for any use case or most of the use cases.Jeremy: Right. Now, and speaking of use cases, so, again, I try to think of use cases or machine learning and whether it's classifying movies into genres or sentiment analysis, like I said, or maybe trying to classify news stories, things like that. Fraud detection, you mentioned. Those are all great use cases, but what are ... I know you've worked on a bunch of projects. So what are some of the projects that you've done and what were the use cases that were being solved there, because I find these to be really interesting?Alexa: Yeah. So I think a nice project that I worked on was a project with Lush, which is a cosmetics company. They manufacture like soaps and bath bombs. And they have this nice mission that they would like to eliminate packaging from their shops. So they asked us, when I worked at Datatonic, we worked on a small project with them. They asked us to create an image recognition model, to train one, and then create a retraining pipeline that they can use afterwards. So they provided us with many hundred thousand images of their products, and they made photos from different angles with different lightings and all of that, so really high-quality image data set of all their products.And then, we used a mobile net model, because they wanted this model to be built-in into their mobile application. So when users actually use this model, they download it with their mobile application. And then, they created a service called Lush [inaudible], which you can use from within their app. And then, people can just scan the products and they can see the ingredients and how-to-use guides and things like that. So this is how they are trying to eliminate all kinds of packaging from their shops, that they don't actually need to put the papers there or put packaging with ingredients and things like that.And in terms of what we did on the technical side, so as I mentioned, we used a mobile net model, because we needed to quantize the model in order to put it on a mobile device. And we used TF Lite to do this. TF Lite is specifically for models that you want to run on an edge device, like a mobile phone. So that was already a constraint. So this is how we picked a model. I think, back then, like there were only a few model architectures supported by TF Lite, and I think there were only two, maybe. So we picked MobileNet, because it had a smaller size.And then, in terms of the retraining, so we automated the whole workflow with Cloud Composer on Google Cloud, which is a managed version of Apache Airflow, the open source scheduling package. The training happened on AI Platform, which is Google Cloud's SageMaker.Jeremy: Yeah.Alexa: Yeah. And what else? We also had an image pre-processing step just before the training, which happened on Dataflow, which is an auto-scaling processing service on Google Cloud. And after we trained the model, we just saved the model active artifact in a bucket, and then ... I think we also monitored the performance of the model, and if it was good enough, then we just shipped the model to developers who actually they manually updated the model file that went into the application that people can download. So we didn't really see if they use any shadow model thing or anything like that.Jeremy: Right. Right. And I think that is such a cool use case, because, if I'm hearing you right, there were just like a bar soap or something like that with no packaging, no nothing, and you just hold your mobile phone camera up to it or it looks at it, determines which particular product is, gives you all that ... so no QR codes, no bar codes, none of that stuff. How did they ring them up though? Do you know how that process worked? Did the employees just have to know what they were or did the employees use the app as well to figure out what they were billing people for?Alexa: Good question. So I think they wanted the employees as well to use the app.Jeremy: Nice.Alexa: Yeah. But when the app was wrong, then I don't know what happened.Jeremy: Just give them a discount on it or something like that. That's awesome. And that's the thing you mentioned there about ... Was it Tensor Lite, was it called?Alexa: TF Lite. Yeah.Jeremy: TF Lite. Yes. TensorFlow Lite or TF Lite. But, basically, that idea of being able to really package a model and get it to be super small like you said. You said edge devices, and I'm thinking serverless compute at the edge, I'm thinking Lambda functions. I'm thinking other ways that if you could get your models small enough in package, that you could run it. But that'd be a pretty cool way to do inference, right? Because, again, even if you're using edge devices, if you're on an edge network or something like that, if you could do that at the edge, that'd be a pretty fast response time.Alexa: Yeah, definitely. Yeah.Jeremy: Awesome. All right. So what about some other stuff that you've done? You've mentioned some things about fraud detection and things like that.Alexa: Yeah. So fraud detection is a use case for Wise. As I mentioned, Wise services international money transfer, one of its services. So, obviously, if you are doing anything with money, then a full use case is for sure that you will have. So, I mean, in terms of ... I don't actually develop models at Wise, so I don't know actually what models they use. I know that they use H2O, which is a Spark-based library that you can use for model training. I think it's quite an advanced library, but I haven't used it myself too much, so I cannot talk about that too much.But in terms of the workflow, it's quite similar. We also have Airflow to schedule the retraining of the models. And they use EMR for data preparation, so quite similar to Dataflow, in a sense. A Spark-based auto-scaling cluster that processes the data and then, they train the models on EMR as well but using this H2O library. And then in the end, when they are happy with the model, we have this tool that they can use for releasing shadow models in production. And then, if they are satisfied with the performance of the model that they can actually release into production. And at Wise, we have a custom micro service, a custom API, for serving models.Jeremy: Right. Right. And that sounds like you need a really good MLOps flow to make all that stuff work, because you just have a lot of moving parts there, right?Alexa: Yeah, definitely. Also, I think we have many bits that could be improved. I think there are many bits that still a bit manual and not streamlined enough. But I think most of the companies struggle with the same thing. It's just we don't yet have those best practices that we can implement, so many people try many different things, and then ... Yeah, so I think it's still a work in progress.Jeremy: Right. Right. And I'm curious if your economics background helps at all with the fraud and the money laundering stuff at all?Alexa: No.Jeremy: No. All right. So what about you worked in another data engineering project for Vodafone, right?Alexa: Yeah. Yeah, so that was a data engineering project purely, so we didn't do any machine learning. Well, Vodafone has their own Google Analytics library that they use in all their websites and mobile apps and things like that and that sense Clickstream data to a server in a Google Cloud Platform Project, and we consume that data in a streaming manner from data flows. So, basically, the project was really about processing this data by writing an Apache Beam pipeline, which was always on and always expected messages to come in. And then, we dumped all the data into BigQuery tables, which is data warehouse in Google Cloud. And then, these BigQuery tables powered some of the dashboards that they use to monitor the uptime and, I don't know, different metrics for their websites and mobile apps.Jeremy: Right. But collecting all of that data is a good source for doing machine learning on top of that, right?Alexa: Yeah, exactly. Yeah. I think they already had some use cases in mind. I'm not sure if they actually done those or not, but it's a really good base for machine learning, what we collected the data there in BigQuery, because that is an analytical data warehouse, so some analysts can already start and explore the data as a first step of the machine learning process.Jeremy: Right. I would think anomaly detection and things like that, right?Alexa: Yeah, exactly.Jeremy: Right. All right. Well, so let's go on and talk about serverless a little bit more, because I know I saw you do a talk where you were you ran some experiments with serverless. And so, I'm just kind of curious, where are the limitations that you see? And I know that there continues ... I mean, we now have EFS integration, and we've got 10 gigs of memory for lambda functions, you've even got Cloud Run, which I don't know how much you could do with that, but where's still some of the limitations for running machine learning in a serverless way, I guess?Alexa: So I think, actually, from this data science lifecycle, many bits, there are Cloud providers offer a lot of serverless options. For data preparation, there is Dataflow, which is, I think, kind of like serverless data processing service, so you can use that for data processing. For model training, there is ... Or the SageMaker and AI Platform, which are kind of serverless, because you don't actually need to provision these clusters that you train your models on. And for model serving, in SageMaker, there are the serverless model endpoints that you can deploy. So there are many options, I think, for serverless in the machine learning lifecycle.In my experience, many times, it's a cost thing. For example, at Wise, we have this custom model serving API, where we serve all our models. And if they would use SageMaker endpoints, I think, a single SageMaker endpoint is about $50 per month, that's the minimum price, and that's for a single model and a single endpoint. And if you have thousands of models, then your price can go up pretty quickly, or maybe not thousands, but hundreds of models, then your price can go up pretty quickly. So I think, in my experience, limitation could be just price.But in terms of ... So I think, for example, if I compare Dataflow with a spark cluster that you program yourself, then I would definitely go with Dataflow. I think it's just much easier and maybe cost-wise as well, you might be better off, I'm not sure. But in terms of comfort and developer experience, it's a much better experience.Jeremy: Right. Right. And so, we talked a little bit about TF Lite there. Is that something possible where maybe the training piece of it, running that on Functions as a Service or something like that maybe isn't the most efficient or cost-effective way to do that, but what about running models or running inference on something like a Lambda function or a Google Cloud function or an Azure function or something like that? Is it possible to package those models in a way that's small enough that you could do that type of workload?Alexa: I think so. Yeah. I think you can definitely make inference using a Lambda function. But in terms of model training, I think that's not a ... Maybe there were already experiments for, I'm sure there were. But I think it's not the kind of workload that would fit for Lambda functions. That's a typical parallelizable, really large-scale workloads for ... You know the MapReduce type of data processing workloads? I think those are not necessarily fit for Lambda functions. So I think for model training and data preparation, maybe those are not the best options, but for model inference, definitely. And I think there are many examples using Lambda functions for inference.Jeremy: Right. Now, do you think that ... because this is always something where I find with serverless, and I know you're more of a data scientist, ML expert, but I look at serverless and I question whether or not it needs to handle some of these things. Especially with some of the endpoints that are out there now, we talked about the Vision API and some of the other NLP things, are we putting in too much effort maybe to try to make serverless be able to handle these things, or is it just something where there's a really good way to handle these by hosting your ... I mean, even if you're doing SageMaker, maybe not SageMaker endpoints, but just running SageMaker machines to do it or whatever, are we trying too hard to squeeze some of these things into a serverless environment?Alexa: Well, I don't know. I think, as a developer, I definitely prefer the more managed versions of these products. So the less I need to bother with, "Oh, my cluster died and now we need to rebuild a cluster of things," and I think serverless can definitely solve that. I would definitely prefer the more managed version. Maybe not serverless, because, for some of the use cases or some of the bits from the lifecycle, serverless is not the best fit, but a managed product is definitely something that I prefer over a non-managed product.Jeremy: Right. And so, I guess one last question for you here, because this is something that always interests me. Just there are relevant things that we need machine learning for. I mean, I think the fraud detection is a hugely important one. Sentiment analysis, again. Some of those other things are maybe, I don't know, I shouldn't call them toy things, but personalization and some of the things, they're all really great things to have, and it seems like you can't build an application now without somebody wanting some piece of that machine learning in there. So do you see that as where we are going where in the future, we're just going to have more of these APIs?I mean, out of AWS, because I'm more familiar with the AWS ecosystem, but they have Personalize and they have Connect and they have all these other services, they have the recommendation engine thing, all these different services ... Lex, or whatever, that will read text, natural language processing and all that kind of stuff. Is that where we're moving to just all these pre-trained, canned products that I can just access via an API or do you think that if you're somebody getting started and you really want to get into the ML world that you should start diving into the TensorFlows and some of those other things?Alexa: So I think if you are building an app and your goal is not to become an ML engineer or a data scientist, then these canned models are really useful because you can have a really good recommendation engine in your product, you could have really good personalization engine in your product, things like that. And so, those are, I think, really useful and you don't need to know any machine learning in order to use them. So I think we definitely go into that direction, because most of the companies won't hire data scientists just to train a recommender model. I think it's just easier to use an API endpoint that is already really good.So I think, yeah, we are definitely heading into that direction. But if you are someone who wants to become a data scientist or wants to be more involved with MLOps or machine learning engineering, then I think jumping into TensorFlow and understanding, maybe not, as we discussed, not getting into the model architectures and things like that, but just understanding the workflow and being able to program a machine learning pipeline from end to end, I think that's definitely recommended.Jeremy: All right. So one last question: If you've ever used the Watson NLP API or the Google Vision API, can you put on your resume that you're a machine learning expert?Alexa: Well, if you really want to do that, I would give it a go. Why not?Jeremy: All right. Good. Good to know. Well, Alexa, thank you so much for sharing all this information. Again, I find the use cases here to be much more complex than maybe some of the surface ones that you sometimes hear about. So, obviously, machine learning is here to stay. It sounds like there's a lot of really good opportunities for people to start kind of dabbling in it and using that without having to become a machine learning expert. But, again, I appreciate your expertise. So if people want to find out more about you or more about the things you're working on and datastack.tv, things like that, how do they do that?Alexa: So we have a Twitter page for datastack.tv, so feel free to follow that. I also have a Twitter page, feel free to follow me, account, not page. There is a datastack.tv website, so it's just datastack.tv. You can go there, and you can check out the courses. And also, we have created a roadmap for data engineers specifically, because there was no good roadmap for data engineers. I definitely recommend checking that out, because we listed most of the tools that a data engineer and also machine learning engineer should know about. So if you're interested in this career path, then I would definitely recommend checking that out. So under datastack.tv's GitHub, there is a roadmap that you can find.Jeremy: Awesome. All right. And that's just, like you said, datastack.tv.Alexa: Yes.Jeremy: I will make sure that we get your Twitter and LinkedIn and GitHub and all that stuff in there. Alexa, thank you so much.Alexa: Thanks. Thank you.

Drill to Detail Ep.88 'Superset, Preset and the Future of Business Intelligence' with Special Guest Maxime Beauchemin

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later Apr 12, 2021 43:55

Maxime Beauchemin returns to the Drill to Detail Podcast and joins Mark Rittman to talk about what's new with Apache Airflow 2.0, the origin story for Apache Superset and now Preset.io, why the future of business intelligence is open source and news on Marquez, a reference implementation of the OpenLineage open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata sponsored by WeWork.The Rise of the Data EngineerDrill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime BeaucheminApache Airflow 2.0 is here!Apache Superset is a modern data exploration and visualization platformThe Future of Business Intelligence is Open SourcePowerful, easy to use data exploration and visualization platform, powered by Apache Superset™Admunsen: Open source data discovery and metadata engineOpenLineageMarquez: Collect, aggregate, and visualize a data ecosystem's metadata

detail drill wework marquez business intelligence data engineers supersets preset apache airflow maxime beauchemin mark rittman

E21 - Grit, Passion, and Intellectual Curiosity with Andrew Ettinger, Astronomer

Tech Sales Insights

Play Episode Listen Later Mar 17, 2021 33:03

He is currently the Chief Revenue Officer of Astronomer which is the commercial developer behind the popular open-source project Apache Airflow. With a deep passion for helping emerging and disruptive technology companies build sustainable teams that thrive and deliver significant value to the marketplace, he has a long track record of sales success. He most recently spent 10 years at Pivotal Software leading them from $0-500M in ARR in 4 years to an IPO. He is very active in the start-up community both investing and advising companies on the best Go-To-Market strategies and plans to effectively build, operate, and scale organizations. Join Randy Seidl and David Nour on this episode of the Sales Community's #TechSalesInsights podcast with Andrew Ettinger. BTW, three quick points: Andrew will be our guest on LinkedIn Live today at Noon ET. Join us for a live "Ask Me Anything" session. We turn these podcast interviews into more in-depth articles, so check them out at SalesCommunity.com We have some fabulous guests joining us in the coming weeks, so hope you'll subscribe wherever you consume podcasts or at SalesCommunity.com/Events. Send in a voice message: https://anchor.fm/salescommunity/message

passion events curiosity grit ipo btw intellectual ask me anything arr chief revenue officer astronomers 500m go to market ettinger intellectual curiosity apache airflow pivotal software

#302 The Data Engineering Landscape in 2021

Play Episode Listen Later Feb 4, 2021 64:33

I'm sure you're familiar with data science. But what about data engineering? Are these the same or how are they related? Data engineering is dedicated to overcoming data-processing bottlenecks, data cleanup, data flow and data-handling problems for applications that utilize lots of data. On this episode, we welcome back Tobias Macey to give us the 30,000 ft view of the data engineering landscape in 2021. Links from the show Live Stream Recordings: YouTube: youtube.com Tobias Macey: boundlessnotions.com Podcast.__init__: pythonpodcast.com Data Engineering podcast: dataengineeringpodcast.com Designing Data-Intensive Applications Book: amazon.com wally: github.com lakeFS: lakefs.io A Beginner’s Guide to Data Engineering: medium.com Apache Airflow: airflow.apache.org Dagster: dagster.io Prefect: prefect.io #68 Crossing the streams with Podcast.__init__: talkpython.fm/68 dbt: getdbt.com Great Expectations: github.com Dask: dask.org Meltano: meltano.com Languages trends on StackOverflow: insights.stackoverflow.com DVC: dvc.org Pandas: pandas.pydata.org Sponsors Datadog Retool Talk Python Training

Episode 96 – re:Invent is here with presents for everyone!

The Cloud Pod

Play Episode Listen Later Dec 8, 2020 69:29

Santa arrived early and he brought all the goods with him to The Cloud Pod this week. The team dives into all the big announcements from AWS re:invent 2020. A big thanks to this week's sponsor: Foghorn Consulting, which provides full-stack cloud solutions with a focus on strategy, planning and execution for enterprises seeking to take advantage of the transformative capabilities of AWS, Google Cloud and Azure. This week's highlights Amazon flips the bird at Microsoft with its Babelfish announcement. AWS is angling for a free Jeep Wrangler with its new service. AWS is helping customers get out of the sticky situation they're in and don't know it. Amazon Web Services: Thankfully They Didn't Ruin Our Predictions Amazon launches managed workflows for Apache Airflow to simplify data processing pipelines. Interesting to see it giving some alternative options. AWS Lambda now has Code Signing, a trust and integrity control to confirm code is unaltered and from a trusted publisher. Not a nice way to start Thanksgiving if you are Palo Alto. Amazon announces centr

amazon thanksgiving microsoft santa aws reinvent palo alto azure google cloud jeep wrangler aws lambda babel fish apache airflow cloud pod foghorn consulting

Introducing Apache Airflow 2.0 - talking about the major new features with Ash Berlin-Taylor and Kaxil Naik

The Tech Trek

Play Episode Listen Later Dec 4, 2020 30:13

Meet: Ash Berlin-Taylor is an active committer and member of the PMC for Apache Airflow and Director of Airflow Engineering team at astronomer.io. Before getting involved with Airflow he spent many years working in (web) infrastructure Kaxil Naik is currently the Committer, PMC Member, and Release Manager of Apache Airflow. He is also the Manager of Airflow Engineering team @ Astronomer. He did his Masters in Data Science & Analytics from Royal Holloway, University of London. What you'll learn: Major features in the upcoming Airflow 2 release Upgrade path and compatibility What the team might be working on in future releases Learn more about the upcoming release: https://www.astronomer.io/blog/introducing-airflow-2-0 If you would like to reach out to Ash or Kaxil about anything they discussed on the podcast, please reach out to them via: https://twitter.com/kaxil https://www.linkedin.com/in/kaxil/ https://twitter.com/ashberlin

director masters berlin analytics ash data science university of london astronomers naik pmc royal holloway airflow apache airflow release manager committer

【毎日AWS #110】新しいワークフローサービス AWS Managed Workflow for Apache Airflows が登場他14件 #サバワ

サーバーワークスが送るAWS情報番組「さばラジ！」

Play Episode Listen Later Nov 26, 2020 13:08

※配信プラットフォームが停止しており配信開始遅れました、、、！最新情報を "ながら" でキャッチアップ！ラジオ感覚放送「毎日AWS！」おはようございます、サーバーワークスの加藤です。今日は 11/24 に出たアップデート15件をご紹介。感想は Twitter にて「#サバワ」をつけて投稿してください！ ■ UPDATE ラインナップ Amazon Managed Workflows for Apache Airflow が登場 Amazon CloudWatch Application Insights が自動アプリケーション検出をサポート Amazon Braket が手動キュービット割当をサポート Amazon CloudWatch Synthetics が Python と Selenium に対応 Amazon Elasticsearch Service が Elasticsearch のバージョン 7.9 に対応 Amazon Elasticsearch Service が Remote Reindex をサポート AWS Storage Gateway Tape Gateway が IBM Spectrum Protect 8.1.10 をサポート AWS Lambda が Advanced Vector Extensiosn 2 をサポート AWS Lambda が Amazon SQS のバッチ処理の待機に対応 Amazon FSx for Lustre がファイルシステムストレージの拡張に対応 AWS Glue がワークロードパーティショニングをサポート AWS Secrets Manager が秒間リクエスト数を 5000 に拡張 Amazon Comprehend Events を発表 Amazon ECS Cluster Auto Scaling がより応答性の高いスケーリングを提供 Amazon RDS for SQL Server がビジネスインテリジェンススイートをサポート ■ サーバーワークスSNS Twitter / Facebook ■ サーバーワークスブログサーバーワークスエンジニアブログ

aws workflow managed apache sql server lustre amazon rds apache airflow aws glue aws secrets manager

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Podcast – Software Engineering Daily

Play Episode Listen Later Jun 10, 2020 64:22

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow’s creation, it has powered the The post Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor appeared first on Software Engineering Daily.

berlin spark snowflakes vikram hadoop koka airflow software engineering daily apache airflow maxime beauchemin

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Data – Software Engineering Daily

Play Episode Listen Later Jun 10, 2020 64:22

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow’s creation, it has powered the The post Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor appeared first on Software Engineering Daily.

berlin spark snowflakes vikram hadoop koka airflow software engineering daily apache airflow maxime beauchemin

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Software Engineering Daily

Play Episode Listen Later Jun 10, 2020 64:22

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow’s creation, it has powered the The post Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor appeared first on Software Engineering Daily.

berlin spark snowflakes vikram hadoop koka airflow software engineering daily apache airflow maxime beauchemin

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Software Daily

Play Episode Listen Later Jun 10, 2020

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake.Since Airflow's creation, it has powered the data infrastructure at companies like Airbnb, Netflix, and Lyft. It has also been at the center of Astronomer, a startup that helps enterprises build infrastructure around Airflow. Airflow is used to construct DAGs–directed acyclic graphs for managing data workflows.Maxime Beauchemin is the creator of Airflow. Vikram Koka and Ash Berlin-Taylor work at Astronomer. They join the show to talk about the state of Airflow–the purpose of the project, its use cases, and open source ecosystem.

netflix berlin airbnb spark lyft snowflakes astronomers dags vikram hadoop koka airflow apache airflow maxime beauchemin

Drill to Detail Ep.80 'Data Architecture and Data Teams at Hubspot' with Special Guest James Densmore

Play Episode Listen Later May 5, 2020 53:32

Mark Rittman is joined in this episode by Hubspot's Director of Data Infrastructure James Densmore to talk about distributed and remote-friendly data teams, DevOps with dbt and Apache Airflow and career path options for data engineers.How should I structure my data team? A look inside HubSpot, Away, M.M. LaFleur, and more (dbt Blog)Software Engineers do not Need to Become Managers to Thrive (Data Liftoff Blog)The Misunderstood Data Engineer (Data Liftoff Blog)Modular ELT (Data Liftoff Blog)Test SQL Pipelines against Production Clones using DBT and Snowflake (Dan Gooden)HubSpot Data Actions, Harvest Analytical Workflows and Looker Data Platform (Rittman Analytics Blog)

director data architecture detail drill hubspot devops dbt lafleur densmore data teams apache airflow mark rittman

Drill to Detail Ep.80 'Data Architecture and Data Teams at Hubspot' with Special Guest James Densmore

redux la f ganadora dan abramov apache airflow

Play Episode Listen Later May 5, 2020 53:32

Mark Rittman is joined in this episode by Hubspot's Director of Data Infrastructure James Densmore to talk about distributed and remote-friendly data teams, DevOps with dbt and Apache Airflow and career path options for data engineers.How should I structure my data team? A look inside HubSpot, Away, M.M. LaFleur, and more (dbt Blog)Software Engineers do not Need to Become Managers to Thrive (Data Liftoff Blog)The Misunderstood Data Engineer (Data Liftoff Blog)Modular ELT (Data Liftoff Blog)Test SQL Pipelines against Production Clones using DBT and Snowflake (Dan Gooden)HubSpot Data Actions, Harvest Analytical Workflows and Looker Data Platform (Rittman Analytics Blog)

director data architecture detail drill hubspot devops dbt lafleur densmore data teams apache airflow mark rittman

La Fórmula Ganadora

DevNights Podcast

Play Episode Listen Later Nov 8, 2019

En este episodio que transmitimos por primera ves En vivo! tuvimos una plática muy interesante y amena con nuestra invitada de lujo Yesi Díaz; platicamos un poco del tuit de Dan Abramov sobre Redux, Apache Airflow, nuevos planes disponibles en Patreon para apoyar a tu podcast favorito 😁 y en los temas principales platicamos extensamente sobre como establecer tus precios para cobrar a tus clientes, la formula ganadora de Mike, diferentes tipos de (posibles) clientes, como calcular tus tiempos de entrega y como (y donde) conseguir un contrato para que tengas una excelente relación laboral.

Podcast #4: Apache Airflow - Zuverlässige Automatisierung geschäftskritischer Workflows

Skillbyte Technologie Podcast

Play Episode Listen Later Oct 23, 2019 37:31

In diesem Podcast geht es um das Thema: Apache Airflow - Zuverlässige Automatisierung von geschäftskritischen Workflows // Inhalt // 1. Was ist Apache Airflow? 2. Wer hat Airflow entwickelt? 3. Welche Unternehmen setzen Airflow ein? 4. Wie funktioniert Airflow im Detail? 5. Welche Erfahrungen hat Skillbyte mit Apache Airflow gemacht? Abonnieren Sie diesen Podcast und besuchen Sie uns auf https://www.skillbyte.de Feedback und Fragen gerne an podcast@skillbyte.de

gesch detail workflows automatisierung welche erfahrungen zuverl airflow abonnieren sie welche unternehmen apache airflow

Airflow Breeze

pr breeze jarek airflow principal software engineer apache airflow

Play Episode Listen Later Oct 17, 2019 46:57

This week, we had the pleasure of meeting up with Jarek Potiuk, Principal Software Engineer at Polidea and Apache Airflow committer, to discuss his most recent contribution to the community, Airflow Breeze. Jarek deeply values developer productivity and realized while building a team of Airflow committers that, in order to open a PR on the project, passing unit tests and waiting for the CI build was a cumbersome process that could take up to a few hours. Breeze seeks to improve that experience for Airflow committers and lower the barrier-to-entry of contribution for folks that are new to the open-source community. You can read more about Airflow Breeze here: https://www.polidea.com/blog/its-a-breeze-to-develop-apache-airflow/#the-apache-airflow-projects-setup

RNR 138: Startup Mindset with Calvin Yu

React Native Radio

Play Episode Listen Later Oct 6, 2019 51:25

In this episode of React Native Radio, Josh Justice interviews Calvin Yu. Calvin is a consultant mostly working with Ruby on Rails but also works with React Native and mobile development. He has quite the history of working with startups, all varying in size. Calvin shares what it was like working with startup companies. Calvin explains what you have to change mentally to work in a startup. First, you have to realize that you don’t have all the answers and that it takes a commitment. He also explains that because you don’t have all the answers you will make a mistake, which means you need to be able to learn from it and move on. Josh and Calvin share their thought on using risky or bleeding edge technologies in a startup. Calvin explains that when developers are looking to join a start-up they want to work in something new, exciting and a little risky. They consider the risks and the benefits, how new technologies could give a startup a leg up on the competition. Josh brings up a blog post titled “Choose Boring Technology”, he summarizes explaining that startups should pick boring, old reliable technology for the parts of the app that don’t matter. The panel moves on to discuss React Native more specifically, Calvin explains why he chose React Native over another cross-platform mobile solution. React Native provides a great experience in the mobile platform, it allows him to give the users what they want. Josh and Calvin discuss what users want from their apps or a user's hierarchy of needs. First, the app needs to be useful, if an app isn’t useful who cares if it performs well. After making sure the app is useful, you can then go back and worry about performance and other secondary needs, Calvin shares the story of how he got into React Native. He was working on some React apps to render kiosk displays when he was approached to build an internal iOS app. The app did some internal functionality for a team of home repair contractors. At the time NativeiOS seemed like overkill for what they wanted. Not to mention they would want the same thing in Android. React Native seemed the obvious choice, so he just dove right in; learning trial by fire. Josh and Calvin consider how React Native has evolved over the years. Calvin shares some of the enduring pros and cons of the framework and explains when to reach for React Native and when to reach for something else. He makes most of his comparisons to Flutter. Flutter is great for game design and custom UI, but React Native is the ideal solution for cross-platform native applications. React Native is well-tuned for reusability. Calvin believes that the React Native ecosystem will grow because it is such an approachable language. Ruby on Rails is considered due to Josh and Calvin’s background in it. Josh considers Ruby on Rails and how it comes with everything you need right out the box but React Native is quite the opposite. This makes Josh wonder what is so appealing about React Native to Calvin. Calvin explains that he hopes that someday React Native will be ready out of the box and gives ideas of how it might get there. Calvin considers the future of software development. He believes that building applications will be pushed up in the stack. That building applications will a thing that anyone can do, just like anyone can use a spreadsheet. He thinks software development will get more approachable and easy tooling that will make building applications much simpler. He considers how comfortable his kids are with technology and touch screens and this will affect future software developers. Panelists Josh Justice Guest Calvin Yu Sponsors Adventures in DevOps React Round Up G2i CacheFly Links Choose Boring Technology blog post Hierarchy of User Needs GraphQL Airtable Coda The Core Team of the Internet (with Yehuda Katz) https://twitter.com/cyu https://github.com/cyu/ https://www.rylabs.io/ https://www.facebook.com/ReactNativeRadio/ https://twitter.com/R_N_Radio Picks Josh Justice: VuePress https://atom.io/ Visual Studio Code Calvin Yu: Visual Studio Code Live Share Apache Airflow

RNR 138: Startup Mindset with Calvin Yu

Devchat.tv Master Feed

Play Episode Listen Later Oct 6, 2019 51:25

In this episode of React Native Radio, Josh Justice interviews Calvin Yu. Calvin is a consultant mostly working with Ruby on Rails but also works with React Native and mobile development. He has quite the history of working with startups, all varying in size. Calvin shares what it was like working with startup companies. Calvin explains what you have to change mentally to work in a startup. First, you have to realize that you don’t have all the answers and that it takes a commitment. He also explains that because you don’t have all the answers you will make a mistake, which means you need to be able to learn from it and move on. Josh and Calvin share their thought on using risky or bleeding edge technologies in a startup. Calvin explains that when developers are looking to join a start-up they want to work in something new, exciting and a little risky. They consider the risks and the benefits, how new technologies could give a startup a leg up on the competition. Josh brings up a blog post titled “Choose Boring Technology”, he summarizes explaining that startups should pick boring, old reliable technology for the parts of the app that don’t matter. The panel moves on to discuss React Native more specifically, Calvin explains why he chose React Native over another cross-platform mobile solution. React Native provides a great experience in the mobile platform, it allows him to give the users what they want. Josh and Calvin discuss what users want from their apps or a user's hierarchy of needs. First, the app needs to be useful, if an app isn’t useful who cares if it performs well. After making sure the app is useful, you can then go back and worry about performance and other secondary needs, Calvin shares the story of how he got into React Native. He was working on some React apps to render kiosk displays when he was approached to build an internal iOS app. The app did some internal functionality for a team of home repair contractors. At the time NativeiOS seemed like overkill for what they wanted. Not to mention they would want the same thing in Android. React Native seemed the obvious choice, so he just dove right in; learning trial by fire. Josh and Calvin consider how React Native has evolved over the years. Calvin shares some of the enduring pros and cons of the framework and explains when to reach for React Native and when to reach for something else. He makes most of his comparisons to Flutter. Flutter is great for game design and custom UI, but React Native is the ideal solution for cross-platform native applications. React Native is well-tuned for reusability. Calvin believes that the React Native ecosystem will grow because it is such an approachable language. Ruby on Rails is considered due to Josh and Calvin’s background in it. Josh considers Ruby on Rails and how it comes with everything you need right out the box but React Native is quite the opposite. This makes Josh wonder what is so appealing about React Native to Calvin. Calvin explains that he hopes that someday React Native will be ready out of the box and gives ideas of how it might get there. Calvin considers the future of software development. He believes that building applications will be pushed up in the stack. That building applications will a thing that anyone can do, just like anyone can use a spreadsheet. He thinks software development will get more approachable and easy tooling that will make building applications much simpler. He considers how comfortable his kids are with technology and touch screens and this will affect future software developers. Panelists Josh Justice Guest Calvin Yu Sponsors Adventures in DevOps React Round Up G2i CacheFly Links Choose Boring Technology blog post Hierarchy of User Needs GraphQL Airtable Coda The Core Team of the Internet (with Yehuda Katz) https://twitter.com/cyu https://github.com/cyu/ https://www.rylabs.io/ https://www.facebook.com/ReactNativeRadio/ https://twitter.com/R_N_Radio Picks Josh Justice: VuePress https://atom.io/ Visual Studio Code Calvin Yu: Visual Studio Code Live Share Apache Airflow

Airflow in Practice with Chaim Turkel

Data – Software Engineering Daily

Play Episode Listen Later Jun 25, 2019 58:10

Apache Airflow is a system for scheduling and monitoring workflows for data engineering. Airflow can be used to schedule ETL jobs, machine learning work, and script execution. Airflow also gives a developer a high level view into the graph of dependencies for their data pipelines. Chaim Turkel is a backend data architect at Tikal. He The post Airflow in Practice with Chaim Turkel appeared first on Software Engineering Daily.

practice etl airflow tikal turkel software engineering daily apache airflow

DTL015 - Apache airflow en het gebruik van open source tools

De Dataloog

Play Episode Listen Later Mar 21, 2019 47:58

Airflow: Apache Airflow is een platform om workflows te organiseren. Workflows kunnen bijvoorbeeld zijn, het periodiek hertraininen van je modellen, overzetten van data van en naar verschillende systemen of het draaien van rapporten. Het zorgt ervoor dat je eenvoudig de afhankelijkheden van een workflow in de gaten kunt houden en actie kunt onderenemen als er iets mis gaat. Airflow wordt veel gebruikt in Data Science / Big Data projecten omdat het data pijplines kan maken die bijvoorbeeld data laten stromen van een datawarehouse naar een data lake in de cloud, of algoritmes met regelmaat kan laten draaien. Airflow is deel van de Apache community en dus open source. De Dataloog spreekt met Fokko Driesprong, die een belangrijke bijdrage levert aan Airflow en onderhand committer is van het product. We spreken over cases waar hij Airflow heeft ingezet, zoals het doen van aanbevelingen in de welbekende NPO Start app, of het leiden van data naar de Cloud. We spreken over de ook over de specifieke punten die het gebruik van open source tools met zich mee brengt. We konen erachter dat niet ieder bedrijf het in zich heeft om open source te implementeren, aldus Fokko. En niet onbelangrijk, de kettingvraag hoe je voorkomt dat open source open sores wordt. Uiteraard zijn de shownotes weer op de https://www.dedataloog.nl/uitzending/dtl014-apache-ai…source-community/ pagina te vinden.

tools cloud open source apache workflows enhet uiteraard gebruik airflow apache airflow fokko

ajitofm 28: Hacking Newspaper

ajitofm

Play Episode Listen Later Jul 1, 2018 78:42

yosukepさん、sisidovskiさん、sugimoto1981さん、makogaさんと日経新聞電子版、Fastly、App Engine、PWA、パフォーマンス測定などについて話しました。経済、株価、ビジネス、政治のニュース:日経電子版 r.nikkei.com Blog — HACK The Nikkei Web SQL Database 日経電子版の“爆速化”を内製で成し遂げた精鋭チームとその手法、効果指標について - ITmedia マーケティング日経電子版　開発内製化の取り組み / nikkei web development 2015 - Speaker Deck 日経、ＦＴ買収を完了　経済メディアで世界最大に　　：日本経済新聞 Andrew Betts | Principal developer advocate at Fastly and member of W3C TAG 日経電子版サイト高速化とPWA対応 / nikkei-high-performance-pwa - Speaker Deck How to solve anything in VCL, part 3: authentication and feature flags at the edge Solving anything in VCL PythonでもPythonじゃなくても使える汎用的なMicroservice実行環境 / nikkei microservice - Speaker Deck Labeled Tab-separated Values (LTSV) RUMとA/Bテストを使ったパフォーマンスのモニタリング — HACK The Nikkei Navigation Timing 若者はみんな使っている？　謎のワード「ギガが減る」とは (1/2) - ITmedia NEWS SpeedCurve: Monitor front-end performance A faster FT.com 日経電子版を支える広告技術 — HACK The Nikkei Intersection Observer API - Web APIs | MDN MutationObserver - Web APIs | MDN API gateway pattern Amazon API Gateway（API を簡単に作成・管理） | AWS 生産性を向上させる情報共有ツール - キータチーム（Qiita:Team） Sam Newman - Backends For Frontends google/gvisor: Container Runtime Sandbox Apache Airflow (incubating) Documentation — Airflow Documentation rundeck/rundeck: Enable Self-Service Operations: Give specific users access to your existing tools, services, and scripts Microservices at Mercari ハイパーリンクを貼るだけで著作権料がかかる通称「リンク税」がEUで導入されようとしている - GIGAZINE ASCII.jp：新聞社が「無断リンク」を禁止する3つの理由｜編集者の眼 (2010) Do Not Track - Wikipedia Do Not Track and the GDPR | W3C Blog Intelligent Tracking Prevention 2.0 | WebKit 壁新聞 - Wikipedia 日経のあゆみ : 企業情報 | 日本経済新聞社テキストの検出 - Amazon Rekognition Vision API - 画像コンテンツ分析 | Google Cloud 日経テレコン - 新聞・雑誌記事のビジネスデータベース HACK The Nikkei フィードバックもお待ちしております！ https://ajito.fm/form/ または Twitter: #ajitofm までどうぞ。

wikipedia hacking api newspapers google cloud microservices pwa fastly ascii mercari webkit do not track 9e amazon rekognition apache airflow vcl itmedia intelligent tracking prevention

Use Cases

head san diego ing astronomers use cases zapier bruin cdo advanced analytics data engineers airflow principal software engineer wepay apache airflow

Play Episode Listen Later Feb 22, 2018 75:47

Episode 2 of The Airflow Podcast is here to discuss six specific use cases that we've seen for Apache Airflow. Here's the lineup: Patrick Atwater (@patwater), Water Data Projects Manager at ARGO Labs: 2:03-5:35 Maksime Pecherskiy (@mrmaksimize), CDO of San Diego: 5:35-23:06 Scott Halgrim (@shalgrim), Data Engineer at Zapier: 23:06-27:27 Bolke de Bruin (@bolke2028), Head of Advanced Analytics at ING: 27:27-39:46 Chris Riccomini (@criccomini), Principal Software Engineer at WePay: 39:46-54:20 Ben Gregory (@benbeingbin), Data Engineer (and noted craft soda enthusiast) at Astronomer: 54:20-1:14:38 Contribute to our open-source library of Airflow plugins at github.com/airflow-plugins Contact us at www.astronomer.io if you're interested in Spacecamp: A guided development program to get your team up and running on Airflow.

The Origins of Airflow

Play Episode Listen Later Feb 6, 2018 45:09

For the first episode of the Airflow Podcast, we met up with Maxime Beauchemin, creator of Airflow, to explore the motivations behind its creation and the problems it was designed to solve. We asked Maxime for his definition of Airflow, the design principles behind hook/operator use, and his vision for the project. Speaker list: Pete DeJoy - Product at Astronomer Viraj Parekh - Data Engineer at Astronomer Maxime Beauchemin - Software Engineer at Lyft, creator of Airflow Talk mentioned at the end of the podcast- Advanced Data Engineering Patterns with Apache Airflow: http://www.ustream.tv/recorded/109227704 Maxime's Blog: https://medium.com/@maximebeauchemin

speaker blog origins lyft maxime airflow apache airflow maxime beauchemin

Season One Teaser

ing advanced analytics apache airflow

Play Episode Listen Later Jan 18, 2018 3:02

A sneak peek at our upcoming podcast about Apache Airflow. Featured in this clip (in order of appearance): Pete DeJoy - Product Specialist at Astronomer Patrick Atwater - Water Data Projects Manager at ARGO Labs Maksime Pecherskiy - Chief Data Officer of the City of San Diego Bolke de Bruin - Head of Advanced Analytics at ING

Data Pipelines at Zymergen with Airflow with Erin Shellman - TWiML Talk #41

This Week in Machine Learning & Artificial Intelligence (AI) Podcast

Play Episode Listen Later Aug 4, 2017 36:08

The show you’re listening to features my interview with Erin Shellman. Erin is a statistician and data science manager with Zymergen, a company using robots and machine learning to engineer better microbes. If you’re wondering what exactly that means, I was too, and we talk about it in the interview. Our conversation focuses on Zymergen’s use of Apache Airflow, an open-source data management platform originating at Airbnb, that Erin and her team uses to create reliable, repeatable data pipelines for its machine learning applications. A quick note before we dive in: As is the case with my other field recordings, there’s a bit of unavoidable background noise in this interview. Sorry about that! The show notes for this episode can be found at https://twimlai.com/talk/41

data airbnb pipelines airflow apache airflow zymergen twiml

Sid Anand on Building Agari’s Cloud-native Data Pipelines with AWS Kinesis and Serverless

The InfoQ Podcast

Play Episode Listen Later Jun 9, 2017 25:38

Wesley Reisz talks to Sid Anand, a data architect at cybersecurity company Agari, about building cloud-native data pipelines. The focus of their discussion is around a solution Agari uses that is built from Amazon Kinesis Streams, serverless functions, and auto scaling groups. Sid Anand is an architect at Agari, and a former technical architect at eBay, Netflix, and LinkedIn. He has 15 years of data infrastructure experience at scale, is a PMC for Apache Airflow, and is also a program committee chair for QCon San Francisco and QCon London. Why listen to this podcast - Real-time data pipeline processing is very latency sensitive - Micro-batching allows much smaller amounts of data to be processed - Use the appropriate data store (or stores) to support the use of the dataIngesting data quickly into a clean database with minimal indexes can be fast - Communicate using a messaging system that supports schema evolution More on this: Quick scan our curated show notes on InfoQ http://bit.ly/2rJU9nB You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Want to see extented shownotes? Check the landing page on InfoQ: http://bit.ly/2rJU9nB

netflix real data micro ebay communicate anand pipelines serverless pmc cloud native kinesis infoq apache airflow agari

Drill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime Beauchemin

Play Episode Listen Later May 15, 2017 60:32

Mark Rittman is joined by Maxime Beauchemin to talk about analytics and data integration at Airbnb, the Apache Airflow and Airbnb Superset open-source projects, and his recent Medium article on "The Rise of the Data Engineer"

airbnb medium detail drill data engineers supersets apache airflow maxime beauchemin mark rittman

Drill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime Beauchemin