Podcasts about dagster

44PODCASTS
95EPISODES
1h 12mAVG DURATION
1MONTHLY NEW EPISODE
Nov 26, 2025LATEST

POPULARITY

20172018201920202021202220232024

Best podcasts about dagster

Data Engineering Podcast

20 episodes with dagster

The Data Stack Show

4 episodes with dagster

What's New In Data

2 episodes with dagster

Contributor

2 episodes with dagster

Data Driven

2 episodes with dagster

The Machine Learning Podcast

2 episodes with dagster

Data Coffee

2 episodes with dagster

Latest podcast episodes about dagster

Re-Air: Bridging Gaps: DevRel, Marketing Synergies, and the Future of Data with Pedram Navid of Dagster Labs

The Data Stack Show

Play Episode Listen Later Nov 26, 2025 53:43

This episode is a re-air of one of our most popular conversations from this year, featuring insights worth revisiting. Thank you for being part of the Data Stack community. Stay up to date with the latest episodes at datastackshow.com. This week on The Data Stack Show, John and Matt welcome Pedram Navid, Chief Dashboard Officer at Dagster Labs. During the conversation, Pedram shares his career evolution from consulting to his current role, where he oversees data, developer relations (DevRel), and marketing. The discussion delves into the synergies between DevRel and marketing, emphasizing the importance of understanding developers' learning preferences. Pedram explains data orchestration, highlighting its role in managing and automating data workflows. He also discusses Daxter's unique asset-based approach, which enhances visibility and control over data processes, catering to users from novices to experts, and so much more. Highlights from this week's conversation include:Pedram's Background and Journey in Data (0:47)Joining Dagster Labs (1:41)Synergies Between Teams (2:56)Developer Marketing Preferences (6:06)Bridging Technical Gaps (9:54)Understanding Data Orchestration (11:05)Dagster's Unique Features (16:07)The Future of Orchestration (18:09)Freeing Up Team Resources (20:30)Market Readiness of the Modern Data Stack (22:20)Career Journey into DevRel and Marketing (26:09)Understanding Technical Audiences (29:33)Building Trust Through Open Source (31:36)Understanding Vendor Lock-In (34:40)AI and Data Orchestration (36:11)Modern Data Stack Evolution (39:09)The Cost of AI Services (41:58)Differentiation Through Integration (44:13)Language and Frameworks in Orchestration (49:45)Future of Orchestration and Closing Thoughts (51:54)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

From Data Engineering to Context Engineering w/ Nick Schrock

The Joe Reis Show

Play Episode Listen Later Nov 20, 2025 44:57

Data engineering is undergoing a fundamental shift. In this episode, I sit down with Nick Schrock, founder and CTO of Dagster, to discuss why he went from being an "AI moderate" to believing 90% of code will be written by AI. Being hands on also led to a massive pivot in Dagster's roadmap and a new focus on managing and engineering context.We dive deep into why simply feeding data to LLMs isn't enough. Nick explains why real-time context tools (like MCPs) can become "token hogs" that lack precision and why the future belongs to "context pipelines": offline, batch-computed context that is governed, versioned, and treated like code.We also explore Compass, Dagster's new collaborative agent that lives in Slack, bridging the gap between business stakeholders and data teams. If you're wondering how your role as a data engineer will evolve in an agentic world, this conversation maps out the territoryDagster: dagster.io Nick Schrock on X: @schrockn

ai data context cto slack compass data engineering schrock mcps dagster

Beyond the Dashboard: Collaborative Analytics in Slack

The Data Exchange with Ben Lorica

Play Episode Listen Later Oct 30, 2025 45:56

Nick Schrock, CTO of Dagster, discusses the critical role of data orchestration in the AI era, framing “context pipelines” as the new data pipelines that form the foundation of any AI strategy. He introduces Compass, a new Slack-native tool for collaborative, exploratory data analysis designed to replace the 80% of ad-hoc BI dashboards. Subscribe to the Gradient Flow Newsletter

ai analytics cto slack bi collaborative compass detailed dashboard dagster

#381 | Watch It Fly By As the Pendulum Swings

Sacred Symbols: A PlayStation Podcast

Play Episode Listen Later Oct 20, 2025 277:12

Time sure is moving, but has a sufficient amount of it passed for us to start considering the next generation of PlayStation console? According to two reliable leakers, the answer is yes: Sony is very much aiming to get PS6 out in 2027. This makes sense by historical standards -- seven years between consoles is fairly normal, if not even a little slow -- but it really doesn't feel like we need this thing. Or do we? By 2027, who knows what our ecosystem might look like. Let's discuss! Plus: Legendary Tecmo director and producer Tomonobu Itagakai has sadly passed away, Sony-owned studio Bluepoint is hiring for a mysterious third-person action project, San Diego Studio appears primed to finally bring its smash-hit MLB: The Show series to PC, Ghost of Yotei is selling at parity with Ghost of Tsushima, we could have very easily gotten a mobile The Last of Us game, and more. Then: Listener inquiries! Do we like to partake in New Game+, when applicable? With Xbox's displacement as a hardware competitor, is Sony and Nintendo's long-standing rivalry primed to be reignited? Why don't we talk more about the fighting scene? Will Dustin regale us with some of his Japanese language skills? Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro0:38:55 - Shoutout Saxon0:43:38 - Eloping in Vegas0:54:26 - Dagster check in?0:55:44 - 今週何個の瓶を埋めたか教えてください0:57:31 - RIP Tomonobu Itagaki1:07:17 - PlayStation 6 in 20271:44:10 - Bluepoint hiring for a 3rd person melee action game1:53:40 - Ghost of Yotei sales similar to Tsushima , Suckerpunch can only do one game at a time2:05:41 - MLB: The Show coming to PC?2:12:13 - Tencent pitched a mobile Last of Us2:24:34 - PSVR2 controller for sale2:32:23 - Fans revive ModNation Racers2:36:49 - Quantic Dream reveals multiplayer game2:48:20 - Remedy's FBC Firebreak is in the red2:57:15 - Build A Rocket Boy in trouble3:01:13 - New PS+ games3:08:23 - What We're Playing (Ghost of Tsushima, Ghost of Yotei, Battlefield 6, Lumines Arise)3:53:35 - Why is New Game+ late?3:58:58 - Why isn't Nintendo competition?4:06:34 - Video game betting4:14:01 - What do we want from a PlayStation handheld?4:19:03 - Why aren't we into fighting games? Learn more about your ad choices. Visit podcastchoices.com/adchoices

Co-creator of GraphQL and Founder of Dagster Labs - Nick Schrock

Infinite Machine Learning

Play Episode Listen Later Aug 20, 2025 51:55 Transcription Available

Nick Schrock is the founder of Dagster Labs, a data platform that helps you build, schedule, and monitor reliable data pipelines. They've raised $49M in funding from investors such as Sequoia, Index, Amplify, Slow, and 8VC. He is also the cocreator of the popular query language GraphQL. Nick's favorite books: The Great CEO Within (Author: Matt Mochary)(00:01) Introduction and Welcome(00:39) The Origins of GraphQL at Facebook(05:24) Explaining Data Orchestration in Plain English(09:03) What Dagster Is and Why It Matters(12:37) Assets vs. Tasks: A New Philosophy(16:51) Balancing Open Source and Commercial Features(22:18) Growing the Early Open Source Community(25:26) Signals of Community Health(27:59) Landing the First 10 Customers(32:25) Culture Shift: From Engineering-Heavy to Go-to-Market(37:49) Mistakes DevTool Founders Often Make(41:21) Selective Micromanagement and Leadership Style(44:36) Rapid Fire Round--------Where to find Nick Schrock: LinkedIn: https://www.linkedin.com/in/schrockn/--------Where to find Prateek Joshi: Newsletter: https://prateekjoshi.substack.com Website: https://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-infiniteX: https://x.com/prateekvjoshi

Redif Top 3 : Comment l'ex-Head of Data de Lydia monte le département Data chez May

Data Gen

Play Episode Listen Later Aug 19, 2025 35:20

Christelle Marfaing, ex-Head of Data de Lydia, est aujourd'hui Chief Data Officer de May, la startup qui a développé une app d'avantages salariés (3 millions d'euros levés en 2022).Cet épisode est le 1er d'une nouvelle série dont l'objectif est d'inviter des Head of Data qui ont déjà monté ou structuré une équipe Data et qui recommencent dans une plus petite structure.Aujourd'hui, Christelle nous parle du lancement du département Data chez May après avoir dirigé une équipe de 14 personnes chez Lydia.On aborde :

head pr data acast top3 aujourd ses monte chez poc visitez genai suivez laissez produit christelle chief data officers data strategy inscrivez data mesh dagster

#126 | Self-Discipline, Going All In

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Jun 10, 2025 198:48

They say less is more. Col, Matty, Lock and Dagster talk about topics. How elegant. 0:00:00 - Intro0:20:30 - Self-Discipline1:24:20 - Going All In Learn more about your ad choices. Visit podcastchoices.com/adchoices

col lock self discipline dagster

246: AI, Abstractions, and the Future of Data Engineering with Pete Hunt of Dagster

The Data Stack Show

Play Episode Listen Later May 28, 2025 48:59

Highlights from this week's conversation include:Pete's Background and Journey in Data (1:36)Evolution of Data Practices (3:02)Integration Challenges with Acquired Companies (5:13)Trust and Safety as a Service (8:12)Transition to Dagster (11:26)Value Creation in Networking (14:42)Observability in Data Pipelines (18:44)The Era of Big Complexity (21:38)Abstraction as a Tool for Complexity (24:41)Composability and Workflow Engines (28:08)The Need for Guardrails (33:13)AI in Development Tools (36:24)Internal Components Marketplace (40:14)Reimagining Data Integration (43:03)Importance of Abstraction in Data Tools (46:17)Parting Advice for Listeners and Closing Thoughts (48:01)The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it's needed to power smarter decisions and better customer experiences. Each week, we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

The PRQL: Breaking Down Silos: Collaborative Data Engineering in the AI Era with Pete Hunt of Dagster

The Data Stack Show

Play Episode Listen Later May 26, 2025 3:20

The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it's needed to power smarter decisions and better customer experiences. Each week, we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

breaking down collaborative silos data engineering dagster rudderstack pete hunt

#123 | Moms, Worst Job Experiences

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later May 20, 2025 213:59

Colin decided to stop paying me by the word.Chris, Dust, Col and Dagster talk about topics 0:00:00 - Intro0:27:02 - Moms1:42:22 - Worst Job Experiences Learn more about your ad choices. Visit podcastchoices.com/adchoices

experiences moms dust col worst jobs dagster

241: Marketing Meets Data: Measuring Impact and Driving Results with Pedram Navid of Dagster Labs

The Data Stack Show

Play Episode Listen Later May 12, 2025 38:42

Highlights from this week's conversation include:Pedram's Background and Journey in Data (1:13)Marketing vs. Data Engineering (2:30)Understanding Marketing Pressures (4:16)Attribution Models and Accountability (8:13)Balancing Marketing and Team Management (12:25)Introduction to Dagster Components (15:00)AI Integration with Data Engineering (19:05)Challenges in Data Support (22:05)Self-Service Data Access (26:07)AI in Data Management (28:25)Organizing Data in Technical Teams (31:25)Challenges in Real-Time Data (33:28)Final Thoughts and Takeaways (37:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

The PRQL: Shifting Gears: From Code to Marketing in the Data World with Pedram Navid of Dagster Labs

The Data Stack Show

Play Episode Listen Later May 9, 2025 1:40

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

marketing data code labs cdp shifting gears navid pedram dagster rudderstack

#121 | Horror Games and Movies, Cartoons, Favorite Rivalries

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later May 6, 2025 167:38

I have a beautiful wife and two awesome kids who I love very much. That being said, they have shown absolutely no interest in listening to this podcast, week in and week out, and that makes me mad. Obviously, there's only one way to handle this. I will now proceed to write really mean things about all three of them right here in this episode description!. . . I mean, I would, but they would never see it anyway, so what's the point?My only real hope for revenge is for them to start their own podcast someday, so I can completely ignore it.Until then, Gene, Cog and the Dagster will continue to go unappreciated in the Moriarty household. Hey, their loss.0:00:00 - Intro0:05:56 - Horror Games and Movies0:51:29 - Cartoons1:47:15 - Favorite Rivalries Learn more about your ad choices. Visit podcastchoices.com/adchoices

movies cartoons rivalries moriarty horror games cog dagster

#119 | Favorite Storytelling Mediums, Multimillionaire What If!

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Apr 22, 2025 218:01

It's Constellation time, welcome back! Sit down, get comfy and grab a snack. Four people, three topics - like we usually do; Wait, correction! Two topics for this episode's crew. Hoeg started us off, and we couldn't stop yappin; It's not really our fault. Hey, it does tend to happen; So we subtracted a topic, as it turns out; But we'll do it next week, so there's no need to pout! Dustin's topic was also extremely fun; (With apologies to William Boyd Watterson.) The real hero this week was the Dagster, of course. Sacrificing his topic with little remorse. Behold the Dagster, with his podcasting powers; Saving this show from becoming six hours! Dear people, what else would a great man do? ... Oh yeah, Colin was also part of this week's crew. Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro 0:36:01 - Favorite Storytelling Mediums 1:51:23 - Multimillionaire What If! Learn more about your ad choices. Visit podcastchoices.com/adchoices

storytelling saving dear behold sacrificing constellations mediums multi millionaire hoeg dagster

#118 | Nintendo Switch 2, Advice for Our 15-Year-Old Selves, Our Favorite Instrumental Music

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Apr 15, 2025 209:54

I get the feeling that Colin thinks I don't work hard enough on these episode descriptions. I mean, I don't, but he isn't supposed to know that. Well, I had an idea: If I pad these paragraphs by just filling this space with a lot of random gibberish, Colin will think I'm really toiling away over here. Everybody wins! So, what do you guys want to talk about? My neighbor's car alarm keeps going off every five minutes, that's super aggravating. How's the weather over by you? I enjoy burritos a lot and I do like guacamole but please never put guac in my burrito, that's gross. When was water skiing invented? I want to surprise my wife by repainting our master bathroom but I know I'll choose the wrong color thereby making her very unhappy, so I better not even try. Most cheeses are good, I can't pick a favorite. I haven't read a book in like, two years. Have you ever wondered why birds and rabbits get along so well? I definitely need a new pillow, mine is so old and lumpy ... ... Wait, this is turning out to be a lot harder than just writing an actual episode description. Damn you, Colin! I give up. Jaffe, Micah, Brad and Dagster talk about topics. Foiled again! Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro 0:17:38 - Nintendo Switch 2 1:11:41 - Advice for Our 15-Year-Old Selves 2:06:20 - Our Favorite Instrumental Music Learn more about your ad choices. Visit podcastchoices.com/adchoices

advice nintendo switch jaffe foiled instrumental music dagster

#117 | Wrestling, Plans that Backfired

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Apr 8, 2025 240:42

Hello and welcome to Constellation! Usually, Dagan is the one who writes these descriptions, but he forgot this week! So, you have me (Dustin), and I'm not even on this episode! He's been getting a bit cheeky with them recently, so it's probably best I'm here to clean things up anyway. For this episode, Matty wants to talk about wrestling, and Dagan asks about plans we've had that have backfired on us. You may be asking, where's the third topic? This episode ended up being quite a doozy just between those two topics, so the last topic needed to be cut due to timing. So, join Dagan, Colin, Matty, and Gene for another episode. And hey, Dagster, don't forget to write the description next week! Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro 0:19:10 - Wrestling 1:52:14 - Plans that Backfired Learn more about your ad choices. Visit podcastchoices.com/adchoices

wrestling plans constellations backfired dagan dagster

#115 | Toasters, The Things We're Surprised We Still Haven't Done, Puppets

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Mar 25, 2025 246:06

Welcome back to another looooooong episode of Constellation! Delivering the goods like this week in and week out requires deep concentration and a sharp attention span. On this episode, Colin, Chris, Micah and Dagster talk for countless hours about a variety of fun and thoughtful topics. The only downside to this whole thing is that all our focus goes into the conversations. I'm lucky to have enough attention span left now to finish writing this video descrip Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro 0:50:02 - Toasters 1:49:12 - The Things We're Surprised We Still Haven't Done 3:00:49 - Puppets Learn more about your ad choices. Visit podcastchoices.com/adchoices

delivering surprised puppets constellations toasters dagster

Data Orchestration: DataOps ed MLOps. #63

Intervista Pythonista

Play Episode Listen Later Mar 17, 2025 47:40

In questa puntata, ci immergiamo nel mondo del MLOps e dell'orchestrazione dei dati con Stefano Bosisio, Senior Software Engineer presso NVIDIA. Stefano condivide le sue conoscenze su framework popolari come Apache Beam, Kubeflow e Dagster, evidenziandone punti di forza e limitazioni. Affrontiamo anche le tendenze emergenti nel DataOps e le sfide che i team devono affrontare nella scelta degli strumenti di orchestrazione più adatti alle loro esigenze.

nvidia stefano orchestration senior software engineer affrontiamo dataops dagster apache beam

#349 | The Last of Us, Seriously

Sacred Symbols: A PlayStation Podcast

Play Episode Listen Later Mar 10, 2025 232:00

We're on the verge of the second season of HBO's The Last of Us, and yet new questions percolate about whether we'll ever see another game in Naughty Dog's vaunted series. In a recent interview, Neil Druckmann told fans "don't bet on there being more," which is both alluringly ominous and obviously vague. That's good news for us, though, since it gives us plenty to discuss. Do we need more? Do we want more? Should we get more? Plus: Rumors are ramping up about one-time Xbox titan Gears of War migrating to PlayStation 5 later this year, while details related to Firesprite's cancelled Twisted Metal project have leaked. Also: PlayStation 2 turns 25 years old in Japan, Until Dawn Remake studio Ballistic Moon goes belly-up, a key Bungie designer joins Guerrilla, Monster Hunter Wilds smashes Capcom sales records, and more. Finally: Listener inquiries! Is it possible that Grand Theft Auto VI fails to meet either critical or commercial expectations, or perhaps both? What are our reflections on PS5 Pro now that we've had our own for some months? Should Canadians stop buying games from American publishers and developers in retaliation of US tariffs? Will Colin ever forgive his brother for continuously referring to himself as "The Dagster"? Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. Timestamps: 0:00:00 - Intro 0:14:43 - Get well soon Bishop! 0:19:09 - Squeaking chair 0:22:12 - 5 thing we achieved this past week 0:30:45 - Last show's debauchery 0:39:57 - "The Dagster" 0:42:56 - Wonder Woman 0:49:23 - Twin Breaker is now on PS5! 0:51:32 - Terminator 2D: No Fate 0:52:58 - Happy Birthday PS2 1:01:52 - Layoffs at PlayStation 1:06:40 - Until Dawn Remake dev shut down 1:15:19 - PS3 update 1:17:21 - Variety's interview with Druckmann and Mazin 1:23:43 - New TLoU controller 1:25:46 - Cancelled Twisted Metal game details leak 1:31:28 - Ex-Bungie designer joins Guerilla 1:35:50 - Forza Horizon 5 releases April 29 on PS5 1:39:29 - Call of Duty 2025 will remain on PS4 1:44:32 - Tony Hawk Pro Skater 3+4 is real 1:49:35 - Monster Hunter Wilds sells 8 million 1:50:21 - La Quimera 1:54:56 - What Are We Playing? 2:16:51 - Gears of War coming to PS5 2:33:53 - New India Hero Project games 2:38:53 - PlayStation beta program 2:45:28 - Acclaim is back 2:57:56 - What if GTA VI disappoints? 3:04:08 - Boycotting American games 3:13:23 - PS5 Pro check in 3:21:37 - Portal's lack of bluetooth 3:26:31 - Auto popping trophies 3:30:48 - Capcom and Mega Man Learn more about your ad choices. Visit podcastchoices.com/adchoices

Trends in Data Engineering – Adrian Brudaru

DataTalks.Club

Play Episode Listen Later Mar 7, 2025 56:59

In this podcast episode, we talked with Adrian Brudaru about the past, present and future of data engineering.About the speaker:Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted.As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.0:00 Introduction to DataTalks.Club1:05 Discussing trends in data engineering with Adrian2:03 Adrian's background and journey into data engineering5:04 Growth and updates on Adrian's company, DLT Hub9:05 Challenges and specialization in data engineering today13:00 Opportunities for data engineers entering the field15:00 The "Modern Data Stack" and its evolution17:25 Emerging trends: AI integration and Iceberg technology27:40 DuckDB and the emergence of portable, cost-effective data stacks32:14 The rise and impact of dbt in data engineering34:08 Alternatives to dbt: SQLMesh and others35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions37:20 Audience questions: Career focus in data roles and AI engineering overlaps39:00 The role of semantics in data and AI workflows41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem

#112 | Historical Eras, AI, Bad Music

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Mar 4, 2025 228:55

Time for Constellation, isn't that grand? The most awesome podcast in all the land. If you're grinding at work or enjoying vacation, you get three fun topics for conversation. This week Hoeg, Col-man, Dagster and Matty talk things over, and things get chatty. Best historical eras, bad music, Ai ... Enjoy the discussion, you're welcome, goodbye! Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro  0:30:55 - Historical Eras 1:19:32 - AI 2:20:31 - Bad Music Learn more about your ad choices. Visit podcastchoices.com/adchoices

time ai historical col eras constellations bad music hoeg dagster

#111 | Fragrances, Phone Calls, World Records

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Feb 25, 2025 201:08

Dwayne 'The Rock' Johnson stopped me on the street once. You can tell he was a little nervous at first, but he finally managed to pull himself together, exclaiming that he was Constellation's biggest fan. I acted surprised and we exchanged pleasantries. It quickly became awkward though, so I thanked him and glanced at my watch with feigned concern before dismissing him with a tussle of his hair. He's a good kid, and so are you. Dwayne (we're on a first name basis now) would definitely love this episode. This week, Brad inquires about our preferences for cologne and other toiletries. From soap and shampoo to deodorant and even candles, which products and fragrances best meet our personal criteria? Next, Chris and the gang commiserate about phone calls. As old-fashioned telephone conversations continue to go the way of the dodo, how do we prefer to communicate in ways that don't involve gabbin' like a couple of elderly ladies? Last, The Dagster leads a discussion about world records. If we could hold the global title for something, which something would we choose? This one's for you 'The Rock.' Enjoy! Oh whatever, in my dream he had hair. Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro 0:23:54 - Fragrances 1:06:39 - Phone Calls 1:51:29 - World Records Learn more about your ad choices. Visit podcastchoices.com/adchoices

world records phone calls constellations fragrance dagster

Lessons in Data Engineering: Scaling, AI, and Open Source with Sandy Ryza

Product by Design

Play Episode Listen Later Feb 7, 2025 46:28

In this episode of Product by Design, Kyle chats with Sandy Ryza, lead engineer on the Dagster project, author, and thought leader in data engineering. Sandy shares his journey through the world of data—from building big data tools at Cloudera to working as a data scientist, product manager, and engineer—and how those experiences led him to help create Dagster, an open-source data orchestration platform.We discuss:The evolution of data engineering and the growing complexity of modern data pipelines.The role of AI and unstructured data in shaping the future of data platforms.How organizations should think about data platforms to avoid costly rework.Best practices for managing data complexity using software engineering principles.The future of open-source tools in data infrastructure and the push toward interoperability.Sandy RyzaSandy is a lead engineer, author, and thought leader in the domain of data engineering. Sandy co-wrote “Advanced Analytics with PySpark” and “Advanced Analytics with Spark”. He led ML and data science teams at Cloudera, Remix, Clover Health, and KeepTruckin.Sandy is currently the lead engineer on the Dagster project, an open-source data orchestration platform used in MLOps, data science, IOT and analytics. Sandy is a regular speaker at data engineering and ML conferences.Links from the Show:Twitter: @s_RYZDagster: dagster.ioBook: Advanced Analytics with Spark – O'ReillyPodcast Recommendation: Empire (British Empire & Ottoman Empire history)Books Sandy is Reading: The Shortest History of India, The Sun Also Rises, Werner Herzog's AutobiographyMore by Kyle:Follow Prodity on Twitter and TikTokFollow Kyle on Twitter and TikTokSign up for the Prodity Newsletter for more updates.Kyle's writing on MediumProdity on MediumLike our podcast, consider Buying Us a Coffee or supporting us on Patreon

#106 | Media Binging, Natural Disasters, Domestic Squabbles

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Jan 21, 2025 198:53

Constellation rings in the new year with the triumphant return of everybody's favorite Final Boss! Colin occupies the 4th chair this week as the boys discuss a range of topics. First, Dustin wants to talk about media binging. Do we prefer to consume TV and video games gradually, aka the old-fashioned way, or do we like to marathon our media from start to finish in one sitting? Next up, as the wildfires continue to blaze through Los Angeles, Brad would like to discuss our thoughts on natural disasters. How do we deal with the potential of hazardous weather and dangerous geographic events striking the places we live? Last, Dagan brings us home with a little discourse about domestic squabbles. What bugs us about the people we share our homes with, and which of our habits incense those very same folks. Dagan thought of at least six additional annoying things about living with Helene since the show was recorded, but we'll save all that for a part two. But probably not because Dagster would prefer to stay married. Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. 0:00:00 - Intro 0:44:58 - Binging 1:28:01 - Natural disasters 2:07:50 - Domestic Squabbles Learn more about your ad choices. Visit podcastchoices.com/adchoices

tv los angeles media natural domestic natural disasters constellations binging final boss squabbles dagan dagster

224: Bridging Gaps: DevRel, Marketing Synergies, and the Future of Data with Pedram Navid of Dagster Labs

The Data Stack Show

Play Episode Listen Later Jan 15, 2025 53:24

Highlights from this week's conversation include:Pedram's Background and Journey in Data (0:47)Joining Dagster Labs (1:41)Synergies Between Teams (2:56)Developer Marketing Preferences (6:06)Bridging Technical Gaps (9:54)Understanding Data Orchestration (11:05)Dagster's Unique Features (16:07)The Future of Orchestration (18:09)Freeing Up Team Resources (20:30)Market Readiness of the Modern Data Stack (22:20)Career Journey into DevRel and Marketing (26:09)Understanding Technical Audiences (29:33)Building Trust Through Open Source (31:36)Understanding Vendor Lock-In (34:40)AI and Data Orchestration (36:11)Modern Data Stack Evolution (39:09)The Cost of AI Services (41:58)Differentiation Through Integration (44:13)Language and Frameworks in Orchestration (49:45)Future of Orchestration and Closing Thoughts (51:54)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

The PRQL: Developer Relations, Marketing Synergies, and the Future of Data Platforms with Pedram Navid of Dagster Labs

The Data Stack Show

Play Episode Listen Later Jan 13, 2025 1:52

marketing data developers platforms labs synergy cdp navid developer relations pedram dagster rudderstack

#103 | Tear Jerkers, Competitiveness, Daylight Savings Time

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Dec 31, 2024 154:19

The holidays is a time of reflection and relaxation. It's a time to kick your feet up and chill with friends and family while enjoying a nice, toasty fire as you sip on a mug of warm cocoa and cherish some much needed peace and quiet. Unless you're us. If you're us, you want to make sure to squeeze in just one more episode of Constellation before the new year. Because, hey, we love you. This week, Lord Cognito starts things off by reminding us that we should be comfortable with our sensitive sides. In fact, Cog encourages us to discuss the things that make us cry. From movies and music to memories and missing our loved ones, there isn't a dry eye in the house as the gang talks about all the things that trigger those tears. Next, Dagster discusses his competitiveness, and he challenges his podcast pals to do the same. What's the one thing that brings out that cutthroat quality inside each of us, driving us to be the best and possessing us to claim victory at any cost? Furthermore, who are the rivals that keep us on our toes? Finally, Ben shares his feelings about Daylight Savings Time. We won't give too much away, but let's put it this way: Daylight Savings Time is lucky it isn't a living, breathing thing because Ben would have murdered it in cold blood years ago. Cog and Dagan also briefly share their feelings while mainly letting Ben vent angrily. Hey Daylight Savings Time, I would stay away from the Pittsburgh area until this whole thing blows over. Just sayin.' Timestamps: 0:00:00 - Intro 0:19:55 - Tearjerkers 0:54:18 - Competitiveness 1:37:44 - Daylight Savings Learn more about your ad choices. Visit podcastchoices.com/adchoices

pittsburgh tear constellations daylight savings time competitiveness cog dagan dagster

#102 |Things We're NOT Afraid Of, Best Friends, Online Rage Bait

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Dec 24, 2024 254:48

Well would you look at that? It's time for another episode of Constellation, you lucky devil! In honor of Friday the 13th, the Dagster gets things started with some conversing about the things we're definitely not afraid of. Which common fears don't bother us in the least? Next, our good buddy Gene Park leads the gang in paying some well deserved lip service to our best friends. From our closest childhood pals to the besties who still rank high in our hearts as adults, who are the favorite past and present peas in or pods? Finally, Micah engages the crew in a thoughtful discussion about online rage bait. From the content creators clearly inviting the outrage to the audience reacting with expected levels of vitriol, the gang explores this controversial strategy of social media engagement in an effort to understand the insanity. And that's that. We didn't talk about the drones. We won't give them the satisfaction, accursed robotic overlords! Timestamps: 0:00:00 - Intro 0:26:38 - Things We're NOT Afraid Of 1:25:42 - Best Friends 2:43:52 - Online Rage Bait Learn more about your ad choices. Visit podcastchoices.com/adchoices

online afraid rage best friends bait constellations dagster

#101 | Disney Remakes, Chivalry, 2024 Reflections

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Dec 17, 2024 196:43

Welcome back to Constellation, the podcast that reminds you that there's so much in life to be grateful for! This week, Lockmort gets the conversation started with a discussion about the seemingly never ending glut of Disney remakes that we' ve all grown accustomed to over the past decade. As the controversial 'Snow White' live-action movie now looms on the horizon, we are once again reminded of the Mouse House's tendency to recreate their older franchises instead of inventing brand new characters and stories. Do we think it stinks? Next, the Dagster wants to speak with the gang about good ol' fashioned chivalry and what these ancient codes of conduct mean to us now in our modern world. Is helping old ladies across the street and holding doors for our fellow man still relevant today, and how has chivalry evolved since the days of old-timey gentlemen and virtuous knights? Last, Matty leads the crew in their reflections of 2024. As we prepare to close out another year, how do we look back on the last 12 months, and what are we looking forward to most in 2025? Also, we know "It's a Wonderful Life" was directed by Frank Capra not Billy Wilder, we were just testing you. Know it all. Please keep in mind that our timestamps are approximate, and will often be slightly off due to dynamic ad placement. Timestamps: 0:00:00 - Intro 0:16:20 - Disney Remakes 1:13:47 - Chivalry 2:00:33 - 2024 Reflections Learn more about your ad choices. Visit podcastchoices.com/adchoices

disney reflections snow white wonderful life constellations chivalry billy wilder frank capra mouse house disney remakes dagster

From Marine Aircraft to Data Engineering: Navigating Social Media Shifts and AI Innovation with Alex Noonan

What's New In Data

Play Episode Listen Later Nov 21, 2024 39:58 Transcription Available

Discover how Alex Noonan transitioned from the flight deck of a Marine aircraft to the intricate world of data engineering. His unique journey, enriched by a stint in finance, gives us a firsthand view of the diverse backgrounds shaping the data industry. As Alex recounts his experiences, we explore the vibrant community he found on data Twitter, a realm buzzing with shared insights and collaborative spirit. However, the landscape shifted following Elon Musk's takeover of Twitter, leading to content fragmentation and a migration towards emerging platforms like Blue Sky. Join us as Alex discusses how these changes have impacted the cohesion and knowledge-sharing dynamics within the data community.Navigate the complex world of professional networking with tips from Alex, as he breaks down the strategic use of platforms like LinkedIn, Reddit, and Hacker News for data professionals. Learn how to creatively tailor your content to fit the quirks of each platform's algorithm, and prepare to engage with varied audiences. The conversation also highlights the transformative potential of AI tools in elevating data processes, reducing mundane tasks, and fostering high-value work. Discover innovations like Dagster and its role as an orchestrator, integrating key business intelligence tools to streamline the data engineer's experience. This episode is a must-listen for anyone intrigued by the evolving interplay of technology, social media, and the power of community.Follow Alex on:Linkedin Twitter BlueskyDagsterWhat's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

social media ai discover navigating elon musk reddit marine navigate shifts blue sky aircraft noonan ai innovation data engineering hacker news dagster

#95 | Reflections, Doctors, Fantasy Worlds

Constellation: Last Stand Media's Conversational Podcast

Play Episode Listen Later Nov 5, 2024 141:37

Welcome back to another episode of Constellation for your ear holes and eyeballs! As the famous saying goes, and then there were three. This week on LSM's conversational podcast, we set sail with a trio of your favorite podcasting personalities rather than the usual foursome. Will this bit of streamlining make this nearly perfect show even better? Kicking things off, Matty talks to his comrades about personal reflections. When consuming various media ranging from video games to movies, how much introspection is involved? What plays into our feelings about the entertainment we choose to spend our time with, and how often do we allow ourselves the right to form judgements honestly while also giving ourselves permission to change our minds about the things we watch, read and play? Next up, Lord Cognito wants to weigh-in on our history with doctors. From the gifted healers that make our lives better to the mediocre medical practitioners who only add to our pain and suffering, what experiences have we had with doctors? As the nature of healthcare becomes increasingly difficult to navigate for the providers and their patients, does the prescription exist for the perfect doc? Finally, the Dagster asks the gang about trading reality for our idyllic fantasy worlds. Given the chance, which fictional places from our favorite works of fiction would we choose to inhabit, and why? Thanks so much for tuning in! Learn more about your ad choices. Visit podcastchoices.com/adchoices

doctors reflections kicking constellations fantasy worlds lsm dagster

From GraphQL to Dagster Labs: How Nick Schrock Is Reinventing Data Infrastructure

Breaking Changes

Play Episode Listen Later Sep 25, 2024 53:13

In this episode of Breaking Changes, Postman Head of Product-Observability Jean Yang sits down with Nick Schrock, the co-creator of GraphQL, to dive into the fascinating journey behind GraphQL's development. They discuss how GraphQL transitioned from an internal system at Facebook to a widely adopted technology—as well as how Nick's newest venture, Dagster Labs, is revolutionizing data orchestration with asset-oriented pipelines. This conversation dives deep into the realm of data engineering and its transformative potential for businesses. Nick and Jean also share insights into the intersection of AI and software engineering, and Nick offers his perspective on responsible AI development. For more on Nick Schrock, check out the following: LinkedIn: https://www.linkedin.com/in/schrockn/ Twitter: https://twitter.com/schrockn GraphQL Website: https://graphql.org/ Follow Jean on Twitter/X @jeanqasaur. And remember, never miss an episode by subscribing to the Breaking Changes Podcast on your favorite streaming platform or Postman's YouTube Channel—just hit that bell for notifications. #BreakingChanges #data #postman #grahpql #TechLeadership #ai #podcast

ai labs reinventing postman graphql schrock data infrastructure dagster breaking changes

AI Pipelines with Maxime Armstrong and Yuhan Luo

Software Engineering Daily

Play Episode Listen Later Sep 24, 2024 44:04

LLMs are becoming more mature and accessible, and many teams are now integrating them into common business practices such as technical support bots, online real-time help, and other knowledge-base-related tasks. However, the high cost of maintaining AI teams and operating AI pipelines is becoming apparent. Maxime Armstrong and Yuhan Luo are Software Engineers at Dagster, The post AI Pipelines with Maxime Armstrong and Yuhan Luo appeared first on Software Engineering Daily.

ai armstrong maxime software engineers pipelines software engineering daily dagster

AI Pipelines with Maxime Armstrong and Yuhan Luo

Podcast – Software Engineering Daily

Play Episode Listen Later Sep 24, 2024 44:04

ai armstrong maxime software engineers pipelines software engineering daily dagster

Building Data Tooling with Sandy Ryza | Ep. 43

Podcast Ruined by a Software Engineer

Play Episode Listen Later Jun 21, 2024 72:28

Sandy Ryza is the lead engineer at Dagster Labs, the group behind the popular developer tool Dagster which allows data practitioners to see all their data assets across their data pipelines.Dive into topics such as going from software engineering to data science / engineering (and back), what performance issues do data applications encounter, how to maintain an open-source project with thousands of developers and much more. Hosted by Perry Tiu.Guest links available at: https://perrytiu.com/podcast/sandy-ryza—Interested being on the show? contact@perrytiu.comSponsorship enquiries: sponsor@perrytiu.comFollow Podcast Ruined by a Software Engineer and leave a review• Apple Podcasts: https://apple.co/3RASg8x• Spotify: https://spoti.fi/3RBAXEw• Youtube: https://youtube.com/@perrytiuMore Podcast Ruined by a Software Engineer• Website: https://perrytiu.com/podcast• Merch: https://perrytiu.com/shop• RSS Feed: https://perrytiu.com/podcast/rss.xmlFollow Perry Tiu• Twitter: https://twitter.com/perry_tiu• LinkedIn: https://linkedin.com/in/perrytiu• Instagram: https://instagram.com/doctorpoor

data dive merch software engineers tooling ryza dagster

Episode 39: The Impact of Data Science on Data Orchestration

Value Driven Data Science

Play Episode Listen Later Jun 19, 2024 39:06

Genevieve Hayes Consulting Episode 39: The Impact of Data Science on Data Orchestration One of the big promises of data science is its ability to combine multiple disparate datasets to produce value-creating insights. But this is only possible if you can get all those disparate datasets together, in the one location, to begin with. The has led to the rise of the data engineer and the data orchestration platform.In this episode, Sandy Ryza joins Dr Genevieve Hayes to discuss the impact of the data scientist on the creation of the next generation of data orchestration tools. Guest Bio Sandy Ryza is a data scientist turned data engineer who is currently the lead engineer on the Dagster project, an open-source data orchestration platform used in MLOps, data science, IOT and analytics. He is also the co-author of Advanced Analytics with Spark. Highlights Welcome to Value Driven Data Science (00:00)Introducing Sandy Ryza and his journey from data scientist to data engineer (01:30)Navigating the challenges of creating consistent data definitions within teams (05:11)The birth and development of Dagster (11:32)Dagster: A tool designed for data scientists (20:54)Final thoughts and advice for data scientists (37:29) Links Connect with Sandy on LinkedInFollow Sandy on XDagster Connect with Genevieve on LinkedInBe among the first to hear about the release of each new podcast episode by signing up HERE The post Episode 39: The Impact of Data Science on Data Orchestration first appeared on Genevieve Hayes Consulting and is written by Dr Genevieve Hayes.

navigating impact spark iot data science orchestration advanced analytics dagster

Episode 39: The Impact of Data Science on Data Orchestration

Value Driven Data Science

Play Episode Listen Later Jun 19, 2024 39:06

One of the big promises of data science is its ability to combine multiple disparate datasets to produce value-creating insights. But this is only possible if you can get all those disparate datasets together, in the one location, to begin with. The has led to the rise of the data engineer and the data orchestration platform.In this episode, Sandy Ryza joins Dr Genevieve Hayes to discuss the impact of the data scientist on the creation of the next generation of data orchestration tools.Guest BioSandy Ryza is a data scientist turned data engineer who is currently the lead engineer on the Dagster project, an open-source data orchestration platform used in MLOps, data science, IOT and analytics. He is also the co-author of Advanced Analytics with Spark.HighlightsWelcome to Value Driven Data Science (00:00)Introducing Sandy Ryza and his journey from data scientist to data engineer (01:30)Navigating the challenges of creating consistent data definitions within teams (05:11)The birth and development of Dagster (11:32)Dagster: A tool designed for data scientists (20:54)Final thoughts and advice for data scientists (37:29)LinksConnect with Sandy on LinkedInFollow Sandy on XDagsterConnect with Genevieve on LinkedInBe among the first to hear about the release of each new podcast episode by signing up HERE

navigating spark iot data science orchestration advanced analytics data engineering dagster

Release Management For Data Platform Services And Logic

Data Engineering Podcast

Play Episode Listen Later May 12, 2024 20:08

Summary Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I want to talk about my experiences managing the QA and release management process of my data platform Interview Introduction As a team, our overall goal is to ensure that the production environment for our data platform is highly stable and reliable. This is the foundational element of establishing and maintaining trust with the consumers of our data. In order to support this effort, we need to ensure that only changes that have been tested and verified are promoted to production. Our current challenge is one that plagues all data teams. We want to have an environment that mirrors our production environment that is available for testing, but it's not feasible to maintain a complete duplicate of all of the production data. Compounding that challenge is the fact that each of the components of our data platform interact with data in slightly different ways and need different processes for ensuring that changes are being promoted safely. Contact Info LinkedIn () Website (https://www.dataengineeringpodcast.com) Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Data Platforms and Leaky Abstractions Episode (https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374) Building A Data Platform From Scratch (https://www.dataengineeringpodcast.com/designing-a-lakehouse-from-scratch-episode-354) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Trino (https://trino.io/) dbt (https://www.getdbt.com/) Starburst Galaxy (https://www.starburst.io/platform/starburst-galaxy/) Superset (https://superset.apache.org/) Dagster (https://dagster.io/) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) Nessie (https://projectnessie.org/) Podcast Episode (https://www.dataengineeringpodcast.com/nessie-data-lakehouse-data-versioning-episode-416) Iceberg (https://iceberg.apache.org/) Snowflake (https://www.snowflake.com/en/) LocalStack (https://www.localstack.cloud/) DSL == Domain Specific Language (https://en.wikipedia.org/wiki/Domain-specific_language) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

#126 - Comment l'ex-Head of Data de Lydia monte le département Data chez May

Data Gen

Play Episode Listen Later Apr 29, 2024 35:20

Christelle Marfaing, ex-Head of Data de Lydia, est aujourd'hui Chief Data Officer de May, la startup qui a développé une app d'avantages salariés (3 millions d'euros levés en 2022).Cet épisode est le 1er d'une nouvelle série dont l'objectif est d'inviter des Head of Data qui ont déjà monté ou structuré une équipe Data et qui recommencent dans une plus petite structure. Aujourd'hui, Christelle nous parle du lancement du département Data chez May après avoir dirigé une équipe de 14 personnes chez Lydia.On aborde :

head pr data acast aujourd ses monte chez poc visitez genai suivez laissez produit christelle chief data officers data strategy inscrivez data mesh dagster

High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 19, 2024 52:20

We are reuniting for the 2nd AI UX demo day in SF on Apr 28. Sign up to demo here! And don't forget tickets for the AI Engineer World's Fair — for early birds who join before keynote announcements!About a year ago there was a lot of buzz around prompt engineering techniques to force structured output. Our friend Simon Willison tweeted a bunch of tips and tricks, but the most iconic one is Riley Goodside making it a matter of life or death:Guardrails (friend of the pod and AI Engineer speaker), Marvin (AI Engineer speaker), and jsonformer had also come out at the time. In June 2023, Jason Liu (today's guest!) open sourced his “OpenAI Function Call and Pydantic Integration Module”, now known as Instructor, which quickly turned prompt engineering black magic into a clean, developer-friendly SDK. A few months later, model providers started to add function calling capabilities to their APIs as well as structured outputs support like “JSON Mode”, which was announced at OpenAI Dev Day (see recap here). In just a handful of months, we went from threatening to kill grandmas to first-class support from the research labs. And yet, Instructor was still downloaded 150,000 times last month. Why?What Instructor looks likeInstructor patches your LLM provider SDKs to offer a new response_model option to which you can pass a structure defined in Pydantic. It currently supports OpenAI, Anthropic, Cohere, and a long tail of models through LiteLLM.What Instructor is forThere are three core use cases to Instructor:* Extracting structured data: Taking an input like an image of a receipt and extracting structured data from it, such as a list of checkout items with their prices, fees, and coupon codes.* Extracting graphs: Identifying nodes and edges in a given input to extract complex entities and their relationships. For example, extracting relationships between characters in a story or dependencies between tasks.* Query understanding: Defining a schema for an API call and using a language model to resolve a request into a more complex one that an embedding could not handle. For example, creating date intervals from queries like “what was the latest thing that happened this week?” to then pass onto a RAG system or similar.Jason called all these different ways of getting data from LLMs “typed responses”: taking strings and turning them into data structures. Structured outputs as a planning toolThe first wave of agents was all about open-ended iteration and planning, with projects like AutoGPT and BabyAGI. Models would come up with a possible list of steps, and start going down the list one by one. It's really easy for them to go down the wrong branch, or get stuck on a single step with no way to intervene.What if these planning steps were returned to us as DAGs using structured output, and then managed as workflows? This also makes it easy to better train model on how to create these plans, as they are much more structured than a bullet point list. Once you have this structure, each piece can be modified individually by different specialized models. You can read some of Jason's experiments here:While LLMs will keep improving (Llama3 just got released as we write this), having a consistent structure for the output will make it a lot easier to swap models in and out. Jason's overall message on how we can move from ReAct loops to more controllable Agent workflows mirrors the “Process” discussion from our Elicit episode:Watch the talkAs a bonus, here's Jason's talk from last year's AI Engineer Summit. He'll also be a speaker at this year's AI Engineer World's Fair!Timestamps* [00:00:00] Introductions* [00:02:23] Early experiments with Generative AI at StitchFix* [00:08:11] Design philosophy behind the Instructor library* [00:11:12] JSON Mode vs Function Calling* [00:12:30] Single vs parallel function calling* [00:14:00] How many functions is too many?* [00:17:39] How to evaluate function calling* [00:20:23] What is Instructor good for?* [00:22:42] The Evolution from Looping to Workflow in AI Engineering* [00:27:03] State of the AI Engineering Stack* [00:28:26] Why Instructor isn't VC backed* [00:31:15] Advice on Pursuing Open Source Projects and Consulting* [00:36:00] The Concept of High Agency and Its Importance* [00:42:44] Prompts as Code and the Structure of AI Inputs and Outputs* [00:44:20] The Emergence of AI Engineering as a Distinct FieldShow notes* Jason on the UWaterloo mafia* Jason on Twitter, LinkedIn, website* Instructor docs* Max Woolf on the potential of Structured Output* swyx on Elo vs Cost* Jason on Anthropic Function Calling* Jason on Rejections, Advice to Young People* Jason on Bad Startup Ideas* Jason on Prompts as Code* Rysana's inversion models* Bryan Bischof's episode* Hamel HusainTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:16]: Hello, we're back in the remote studio with Jason Liu from Instructor. Welcome Jason.Jason [00:00:21]: Hey there. Thanks for having me.Swyx [00:00:23]: Jason, you are extremely famous, so I don't know what I'm going to do introducing you, but you're one of the Waterloo clan. There's like this small cadre of you that's just completely dominating machine learning. Actually, can you list like Waterloo alums that you're like, you know, are just dominating and crushing it right now?Jason [00:00:39]: So like John from like Rysana is doing his inversion models, right? I know like Clive Chen from Waterloo. When I started the data science club, he was one of the guys who were like joining in and just like hanging out in the room. And now he was at Tesla working with Karpathy, now he's at OpenAI, you know.Swyx [00:00:56]: He's in my climbing club.Jason [00:00:58]: Oh, hell yeah. I haven't seen him in like six years now.Swyx [00:01:01]: To get in the social scene in San Francisco, you have to climb. So both in career and in rocks. So you started a data science club at Waterloo, we can talk about that, but then also spent five years at Stitch Fix as an MLE. You pioneered the use of OpenAI's LLMs to increase stylist efficiency. So you must have been like a very, very early user. This was like pretty early on.Jason [00:01:20]: Yeah, I mean, this was like GPT-3, okay. So we actually were using transformers at Stitch Fix before the GPT-3 model. So we were just using transformers for recommendation systems. At that time, I was very skeptical of transformers. I was like, why do we need all this infrastructure? We can just use like matrix factorization. When GPT-2 came out, I fine tuned my own GPT-2 to write like rap lyrics and I was like, okay, this is cute. Okay, I got to go back to my real job, right? Like who cares if I can write a rap lyric? When GPT-3 came out, again, I was very much like, why are we using like a post request to review every comment a person leaves? Like we can just use classical models. So I was very against language models for like the longest time. And then when ChatGPT came out, I basically just wrote a long apology letter to everyone at the company. I was like, hey guys, you know, I was very dismissive of some of this technology. I didn't think it would scale well, and I am wrong. This is incredible. And I immediately just transitioned to go from computer vision recommendation systems to LLMs. But funny enough, now that we have RAG, we're kind of going back to recommendation systems.Swyx [00:02:21]: Yeah, speaking of that, I think Alessio is going to bring up the next one.Alessio [00:02:23]: Yeah, I was going to say, we had Bryan Bischof from Hex on the podcast. Did you overlap at Stitch Fix?Jason [00:02:28]: Yeah, he was like one of my main users of the recommendation frameworks that I had built out at Stitch Fix.Alessio [00:02:32]: Yeah, we talked a lot about RecSys, so it makes sense.Swyx [00:02:36]: So now I have adopted that line, RAG is RecSys. And you know, if you're trying to reinvent new concepts, you should study RecSys first, because you're going to independently reinvent a lot of concepts. So your system was called Flight. It's a recommendation framework with over 80% adoption, servicing 350 million requests every day. Wasn't there something existing at Stitch Fix? Why did you have to write one from scratch?Jason [00:02:56]: No, so I think because at Stitch Fix, a lot of the machine learning engineers and data scientists were writing production code, sort of every team's systems were very bespoke. It's like, this team only needs to do like real time recommendations with small data. So they just have like a fast API app with some like pandas code. This other team has to do a lot more data. So they have some kind of like Spark job that does some batch ETL that does a recommendation. And so what happens is each team writes their code differently. And I have to come in and refactor their code. And I was like, oh man, I'm refactoring four different code bases, four different times. Wouldn't it be better if all the code quality was my fault? Let me just write this framework, force everyone else to use it. And now one person can maintain five different systems, rather than five teams having their own bespoke system. And so it was really a need of just sort of standardizing everything. And then once you do that, you can do observability across the entire pipeline and make large sweeping improvements in this infrastructure, right? If we notice that something is slow, we can detect it on the operator layer. Just hey, hey, like this team, you guys are doing this operation is lowering our latency by like 30%. If you just optimize your Python code here, we can probably make an extra million dollars. So let's jump on a call and figure this out. And then a lot of it was doing all this observability work to figure out what the heck is going on and optimize this system from not only just a code perspective, sort of like harassingly or against saying like, we need to add caching here. We're doing duplicated work here. Let's go clean up the systems. Yep.Swyx [00:04:22]: Got it. One more system that I'm interested in finding out more about is your similarity search system using Clip and GPT-3 embeddings and FIASS, where you saved over $50 million in annual revenue. So of course they all gave all that to you, right?Jason [00:04:34]: No, no, no. I mean, it's not going up and down, but you know, I got a little bit, so I'm pretty happy about that. But there, you know, that was when we were doing fine tuning like ResNets to do image classification. And so a lot of it was given an image, if we could predict the different attributes we have in the merchandising and we can predict the text embeddings of the comments, then we can kind of build a image vector or image embedding that can capture both descriptions of the clothing and sales of the clothing. And then we would use these additional vectors to augment our recommendation system. And so with the recommendation system really was just around like, what are similar items? What are complimentary items? What are items that you would wear in a single outfit? And being able to say on a product page, let me show you like 15, 20 more things. And then what we found was like, hey, when you turn that on, you make a bunch of money.Swyx [00:05:23]: Yeah. So, okay. So you didn't actually use GPT-3 embeddings. You fine tuned your own? Because I was surprised that GPT-3 worked off the shelf.Jason [00:05:30]: Because I mean, at this point we would have 3 million pieces of inventory over like a billion interactions between users and clothes. So any kind of fine tuning would definitely outperform like some off the shelf model.Swyx [00:05:41]: Cool. I'm about to move on from Stitch Fix, but you know, any other like fun stories from the Stitch Fix days that you want to cover?Jason [00:05:46]: No, I think that's basically it. I mean, the biggest one really was the fact that I think for just four years, I was so bearish on language models and just NLP in general. I'm just like, none of this really works. Like, why would I spend time focusing on this? I got to go do the thing that makes money, recommendations, bounding boxes, image classification. Yeah. Now I'm like prompting an image model. I was like, oh man, I was wrong.Swyx [00:06:06]: So my Stitch Fix question would be, you know, I think you have a bit of a drip and I don't, you know, my primary wardrobe is free startup conference t-shirts. Should more technology brothers be using Stitch Fix? What's your fashion advice?Jason [00:06:19]: Oh man, I mean, I'm not a user of Stitch Fix, right? It's like, I enjoy going out and like touching things and putting things on and trying them on. Right. I think Stitch Fix is a place where you kind of go because you want the work offloaded. I really love the clothing I buy where I have to like, when I land in Japan, I'm doing like a 45 minute walk up a giant hill to find this weird denim shop. That's the stuff that really excites me. But I think the bigger thing that's really captured is this idea that narrative matters a lot to human beings. Okay. And I think the recommendation system, that's really hard to capture. It's easy to use AI to sell like a $20 shirt, but it's really hard for AI to sell like a $500 shirt. But people are buying $500 shirts, you know what I mean? There's definitely something that we can't really capture just yet that we probably will figure out how to in the future.Swyx [00:07:07]: Well, it'll probably output in JSON, which is what we're going to turn to next. Then you went on a sabbatical to South Park Commons in New York, which is unusual because it's based on USF.Jason [00:07:17]: Yeah. So basically in 2020, really, I was enjoying working a lot as I was like building a lot of stuff. This is where we were making like the tens of millions of dollars doing stuff. And then I had a hand injury. And so I really couldn't code anymore for like a year, two years. And so I kind of took sort of half of it as medical leave, the other half I became more of like a tech lead, just like making sure the systems were like lights were on. And then when I went to New York, I spent some time there and kind of just like wound down the tech work, you know, did some pottery, did some jujitsu. And after GPD came out, I was like, oh, I clearly need to figure out what is going on here because something feels very magical. I don't understand it. So I spent basically like five months just prompting and playing around with stuff. And then afterwards, it was just my startup friends going like, hey, Jason, you know, my investors want us to have an AI strategy. Can you help us out? And it just snowballed and bore more and more until I was making this my full time job. Yeah, got it.Swyx [00:08:11]: You know, you had YouTube University and a journaling app, you know, a bunch of other explorations. But it seems like the most productive or the best known thing that came out of your time there was Instructor. Yeah.Jason [00:08:22]: Written on the bullet train in Japan. I think at some point, you know, tools like Guardrails and Marvin came out. Those are kind of tools that I use XML and Pytantic to get structured data out. But they really were doing things sort of in the prompt. And these are built with sort of the instruct models in mind. Like I'd already done that in the past. Right. At Stitch Fix, you know, one of the things we did was we would take a request note and turn that into a JSON object that we would use to send it to our search engine. Right. So if you said like, I want to, you know, skinny jeans that were this size, that would turn into JSON that we would send to our internal search APIs. But it always felt kind of gross. A lot of it is just like you read the JSON, you like parse it, you make sure the names are strings and ages are numbers and you do all this like messy stuff. But when function calling came out, it was very much sort of a new way of doing things. Right. Function calling lets you define the schema separate from the data and the instructions. And what this meant was you can kind of have a lot more complex schemas and just map them in Pytantic. And then you can just keep those very separate. And then once you add like methods, you can add validators and all that kind of stuff. The one thing I really had with a lot of these libraries, though, was it was doing a lot of the string formatting themselves, which was fine when it was the instruction to models. You just have a string. But when you have these new chat models, you have these chat messages. And I just didn't really feel like not being able to access that for the developer was sort of a good benefit that they would get. And so I just said, let me write like the most simple SDK around the OpenAI SDK, a simple wrapper on the SDK, just handle the response model a bit and kind of think of myself more like requests than actual framework that people can use. And so the goal is like, hey, like this is something that you can use to build your own framework. But let me just do all the boring stuff that nobody really wants to do. People want to build their own frameworks, but people don't want to build like JSON parsing.Swyx [00:10:08]: And the retrying and all that other stuff.Jason [00:10:10]: Yeah.Swyx [00:10:11]: Right. We had this a little bit of this discussion before the show, but like that design principle of going for being requests rather than being Django. Yeah. So what inspires you there? This has come from a lot of prior pain. Are there other open source projects that inspired your philosophy here? Yeah.Jason [00:10:25]: I mean, I think it would be requests, right? Like, I think it is just the obvious thing you install. If you were going to go make HTTP requests in Python, you would obviously import requests. Maybe if you want to do more async work, there's like future tools, but you don't really even think about installing it. And when you do install it, you don't think of it as like, oh, this is a requests app. Right? Like, no, this is just Python. The bigger question is, like, a lot of people ask questions like, oh, why isn't requests like in the standard library? Yeah. That's how I want my library to feel, right? It's like, oh, if you're going to use the LLM SDKs, you're obviously going to install instructor. And then I think the second question would be like, oh, like, how come instructor doesn't just go into OpenAI, go into Anthropic? Like, if that's the conversation we're having, like, that's where I feel like I've succeeded. Yeah. It's like, yeah, so standard, you may as well just have it in the base libraries.Alessio [00:11:12]: And the shape of the request stayed the same, but initially function calling was maybe equal structure outputs for a lot of people. I think now the models also support like JSON mode and some of these things and, you know, return JSON or my grandma is going to die. All of that stuff is maybe to decide how have you seen that evolution? Like maybe what's the metagame today? Should people just forget about function calling for structure outputs or when is structure output like JSON mode the best versus not? We'd love to get any thoughts given that you do this every day.Jason [00:11:42]: Yeah, I would almost say these are like different implementations of like the real thing we care about is the fact that now we have typed responses to language models. And because we have that type response, my IDE is a little bit happier. I get autocomplete. If I'm using the response wrong, there's a little red squiggly line. Like those are the things I care about in terms of whether or not like JSON mode is better. I usually think it's almost worse unless you want to spend less money on like the prompt tokens that the function call represents, primarily because with JSON mode, you don't actually specify the schema. So sure, like JSON load works, but really, I care a lot more than just the fact that it is JSON, right? I think function calling gives you a tool to specify the fact like, okay, this is a list of objects that I want and each object has a name or an age and I want the age to be above zero and I want to make sure it's parsed correctly. That's where kind of function calling really shines.Alessio [00:12:30]: Any thoughts on single versus parallel function calling? So I did a presentation at our AI in Action Discord channel, and obviously showcase instructor. One of the big things that we have before with single function calling is like when you're trying to extract lists, you have to make these funky like properties that are lists to then actually return all the objects. How do you see the hack being put on the developer's plate versus like more of this stuff just getting better in the model? And I know you tweeted recently about Anthropic, for example, you know, some lists are not lists or strings and there's like all of these discrepancies.Jason [00:13:04]: I almost would prefer it if it was always a single function call. Obviously, there is like the agents workflows that, you know, Instructor doesn't really support that well, but are things that, you know, ought to be done, right? Like you could define, I think maybe like 50 or 60 different functions in a single API call. And, you know, if it was like get the weather or turn the lights on or do something else, it makes a lot of sense to have these parallel function calls. But in terms of an extraction workflow, I definitely think it's probably more helpful to have everything be a single schema, right? Just because you can sort of specify relationships between these entities that you can't do in a parallel function calling, you can have a single chain of thought before you generate a list of results. Like there's like small like API differences, right? Where if it's for parallel function calling, if you do one, like again, really, I really care about how the SDK looks and says, okay, do I always return a list of functions or do you just want to have the actual object back out and you want to have like auto complete over that object? Interesting.Alessio [00:14:00]: What's kind of the cap for like how many function definitions you can put in where it still works well? Do you have any sense on that?Jason [00:14:07]: I mean, for the most part, I haven't really had a need to do anything that's more than six or seven different functions. I think in the documentation, they support way more. I don't even know if there's any good evals that have over like two dozen function calls. I think if you're running into issues where you have like 20 or 50 or 60 function calls, I think you're much better having those specifications saved in a vector database and then have them be retrieved, right? So if there are 30 tools, like you should basically be like ranking them and then using the top K to do selection a little bit better rather than just like shoving like 60 functions into a single. Yeah.Swyx [00:14:40]: Yeah. Well, I mean, so I think this is relevant now because previously I think context limits prevented you from having more than a dozen tools anyway. And now that we have million token context windows, you know, a cloud recently with their new function calling release said they can handle over 250 tools, which is insane to me. That's, that's a lot. You're saying like, you know, you don't think there's many people doing that. I think anyone with a sort of agent like platform where you have a bunch of connectors, they wouldn't run into that problem. Probably you're right that they should use a vector database and kind of rag their tools. I know Zapier has like a few thousand, like 8,000, 9,000 connectors that, you know, obviously don't fit anywhere. So yeah, I mean, I think that would be it unless you need some kind of intelligence that chains things together, which is, I think what Alessio is coming back to, right? Like there's this trend about parallel function calling. I don't know what I think about that. Anthropic's version was, I think they use multiple tools in sequence, but they're not in parallel. I haven't explored this at all. I'm just like throwing this open to you as to like, what do you think about all these new things? Yeah.Jason [00:15:40]: It's like, you know, do we assume that all function calls could happen in any order? In which case, like we either can assume that, or we can assume that like things need to happen in some kind of sequence as a DAG, right? But if it's a DAG, really that's just like one JSON object that is the entire DAG rather than going like, okay, the order of the function that return don't matter. That's definitely just not true in practice, right? Like if I have a thing that's like turn the lights on, like unplug the power, and then like turn the toaster on or something like the order doesn't matter. And it's unclear how well you can describe the importance of that reasoning to a language model yet. I mean, I'm sure you can do it with like good enough prompting, but I just haven't any use cases where the function sequence really matters. Yeah.Alessio [00:16:18]: To me, the most interesting thing is the models are better at picking than your ranking is usually. Like I'm incubating a company around system integration. For example, with one system, there are like 780 endpoints. And if you're actually trying to do vector similarity, it's not that good because the people that wrote the specs didn't have in mind making them like semantically apart. You know, they're kind of like, oh, create this, create this, create this. Versus when you give it to a model, like in Opus, you put them all, it's quite good at picking which ones you should actually run. And I'm curious to see if the model providers actually care about some of those workflows or if the agent companies are actually going to build very good rankers to kind of fill that gap.Jason [00:16:58]: Yeah. My money is on the rankers because you can do those so easily, right? You could just say, well, given the embeddings of my search query and the embeddings of the description, I can just train XGBoost and just make sure that I have very high like MRR, which is like mean reciprocal rank. And so the only objective is to make sure that the tools you use are in the top end filtered. Like that feels super straightforward and you don't have to actually figure out how to fine tune a language model to do tool selection anymore. Yeah. I definitely think that's the case because for the most part, I imagine you either have like less than three tools or more than a thousand. I don't know what kind of company said, oh, thank God we only have like 185 tools and this works perfectly, right? That's right.Alessio [00:17:39]: And before we maybe move on just from this, it was interesting to me, you retweeted this thing about Anthropic function calling and it was Joshua Brown's retweeting some benchmark that it's like, oh my God, Anthropic function calling so good. And then you retweeted it and then you tweeted it later and it's like, it's actually not that good. What's your flow? How do you actually test these things? Because obviously the benchmarks are lying, right? Because the benchmarks say it's good and you said it's bad and I trust you more than the benchmark. How do you think about that? And then how do you evolve it over time?Jason [00:18:09]: It's mostly just client data. I actually have been mostly busy with enough client work that I haven't been able to reproduce public benchmarks. And so I can't even share some of the results in Anthropic. I would just say like in production, we have some pretty interesting schemas where it's like iteratively building lists where we're doing like updates of lists, like we're doing in place updates. So like upserts and inserts. And in those situations we're like, oh yeah, we have a bunch of different parsing errors. Numbers are being returned to strings. We were expecting lists of objects, but we're getting strings that are like the strings of JSON, right? So we had to call JSON parse on individual elements. Overall, I'm like super happy with the Anthropic models compared to the OpenAI models. Sonnet is very cost effective. Haiku is in function calling, it's actually better, but I think they just had to sort of file down the edges a little bit where like our tests pass, but then we actually deployed a production. We got half a percent of traffic having issues where if you ask for JSON, it'll try to talk to you. Or if you use function calling, you know, we'll have like a parse error. And so I think that definitely gonna be things that are fixed in like the upcoming weeks. But in terms of like the reasoning capabilities, man, it's hard to beat like 70% cost reduction, especially when you're building consumer applications, right? If you're building something for consultants or private equity, like you're charging $400, it doesn't really matter if it's a dollar or $2. But for consumer apps, it makes products viable. If you can go from four to Sonnet, you might actually be able to price it better. Yeah.Swyx [00:19:31]: I had this chart about the ELO versus the cost of all the models. And you could put trend graphs on each of those things about like, you know, higher ELO equals higher cost, except for Haiku. Haiku kind of just broke the lines, or the ISO ELOs, if you want to call it. Cool. Before we go too far into your opinions on just the overall ecosystem, I want to make sure that we map out the surface area of Instructor. I would say that most people would be familiar with Instructor from your talks and your tweets and all that. You had the number one talk from the AI Engineer Summit.Jason [00:20:03]: Two Liu. Jason Liu and Jerry Liu. Yeah.Swyx [00:20:06]: Yeah. Until I actually went through your cookbook, I didn't realize the surface area. How would you categorize the use cases? You have LLM self-critique, you have knowledge graphs in here, you have PII data sanitation. How do you characterize to people what is the surface area of Instructor? Yeah.Jason [00:20:23]: This is the part that feels crazy because really the difference is LLMs give you strings and Instructor gives you data structures. And once you get data structures, again, you can do every lead code problem you ever thought of. Right. And so I think there's a couple of really common applications. The first one obviously is extracting structured data. This is just be, okay, well, like I want to put in an image of a receipt. I want to give it back out a list of checkout items with a price and a fee and a coupon code or whatever. That's one application. Another application really is around extracting graphs out. So one of the things we found out about these language models is that not only can you define nodes, it's really good at figuring out what are nodes and what are edges. And so we have a bunch of examples where, you know, not only do I extract that, you know, this happens after that, but also like, okay, these two are dependencies of another task. And you can do, you know, extracting complex entities that have relationships. Given a story, for example, you could extract relationships of families across different characters. This can all be done by defining a graph. The last really big application really is just around query understanding. The idea is that like any API call has some schema and if you can define that schema ahead of time, you can use a language model to resolve a request into a much more complex request. One that an embedding could not do. So for example, I have a really popular post called like rag is more than embeddings. And effectively, you know, if I have a question like this, what was the latest thing that happened this week? That embeds to nothing, right? But really like that query should just be like select all data where the date time is between today and today minus seven days, right? What if I said, how did my writing change between this month and last month? Again, embeddings would do nothing. But really, if you could do like a group by over the month and a summarize, then you could again like do something much more interesting. And so this really just calls out the fact that embeddings really is kind of like the lowest hanging fruit. And using something like instructor can really help produce a data structure. And then you can just use your computer science and reason about the data structure. Maybe you say, okay, well, I'm going to produce a graph where I want to group by each month and then summarize them jointly. You can do that if you know how to define this data structure. Yeah.Swyx [00:22:29]: So you kind of run up against like the LangChains of the world that used to have that. They still do have like the self querying, I think they used to call it when we had Harrison on in our episode. How do you see yourself interacting with the other LLM frameworks in the ecosystem? Yeah.Jason [00:22:42]: I mean, if they use instructor, I think that's totally cool. Again, it's like, it's just Python, right? It's like asking like, oh, how does like Django interact with requests? Well, you just might make a request.get in a Django app, right? But no one would say, I like went off of Django because I'm using requests now. They should be ideally like sort of the wrong comparison in terms of especially like the agent workflows. I think the real goal for me is to go down like the LLM compiler route, which is instead of doing like a react type reasoning loop. I think my belief is that we should be using like workflows. If we do this, then we always have a request and a complete workflow. We can fine tune a model that has a better workflow. Whereas it's hard to think about like, how do you fine tune a better react loop? Yeah. You always train it to have less looping, in which case like you wanted to get the right answer the first time, in which case it was a workflow to begin with, right?Swyx [00:23:31]: Can you define workflow? Because I used to work at a workflow company, but I'm not sure this is a good term for everybody.Jason [00:23:36]: I'm thinking workflow in terms of like the prefect Zapier workflow. Like I want to build a DAG, I want you to tell me what the nodes and edges are. And then maybe the edges are also put in with AI. But the idea is that like, I want to be able to present you the entire plan and then ask you to fix things as I execute it, rather than going like, hey, I couldn't parse the JSON, so I'm going to try again. I couldn't parse the JSON, I'm going to try again. And then next thing you know, you spent like $2 on opening AI credits, right? Yeah. Whereas with the plan, you can just say, oh, the edge between node like X and Y does not run. Let me just iteratively try to fix that, fix the one that sticks, go on to the next component. And obviously you can get into a world where if you have enough examples of the nodes X and Y, maybe you can use like a vector database to find a good few shot examples. You can do a lot if you sort of break down the problem into that workflow and executing that workflow, rather than looping and hoping the reasoning is good enough to generate the correct output. Yeah.Swyx [00:24:35]: You know, I've been hammering on Devon a lot. I got access a couple of weeks ago. And obviously for simple tasks, it does well. For the complicated, like more than 10, 20 hour tasks, I can see- That's a crazy comparison.Jason [00:24:47]: We used to talk about like three, four loops. Only once it gets to like hour tasks, it's hard.Swyx [00:24:54]: Yeah. Less than an hour, there's nothing.Jason [00:24:57]: That's crazy.Swyx [00:24:58]: I mean, okay. Maybe my goalposts have shifted. I don't know. That's incredible.Jason [00:25:02]: Yeah. No, no. I'm like sub one minute executions. Like the fact that you're talking about 10 hours is incredible.Swyx [00:25:08]: I think it's a spectrum. I think I'm going to say this every single time I bring up Devon. Let's not reward them for taking longer to do things. Do you know what I mean? I think that's a metric that is easily abusable.Jason [00:25:18]: Sure. Yeah. You know what I mean? But I think if you can monotonically increase the success probability over an hour, that's winning to me. Right? Like obviously if you run an hour and you've made no progress. Like I think when we were in like auto GBT land, there was that one example where it's like, I wanted it to like buy me a bicycle overnight. I spent $7 on credit and I never found the bicycle. Yeah.Swyx [00:25:41]: Yeah. Right. I wonder if you'll be able to purchase a bicycle. Because it actually can do things in real world. It just needs to suspend to you for off and stuff. The point I was trying to make was that I can see it turning plans. I think one of the agents loopholes or one of the things that is a real barrier for agents is LLMs really like to get stuck into a lane. And you know what you're talking about, what I've seen Devon do is it gets stuck in a lane and it will just kind of change plans based on the performance of the plan itself. And it's kind of cool.Jason [00:26:05]: I feel like we've gone too much in the looping route and I think a lot of more plans and like DAGs and data structures are probably going to come back to help fill in some holes. Yeah.Alessio [00:26:14]: What do you think of the interface to that? Do you see it's like an existing state machine kind of thing that connects to the LLMs, the traditional DAG players? Do you think we need something new for like AI DAGs?Jason [00:26:25]: Yeah. I mean, I think that the hard part is going to be describing visually the fact that this DAG can also change over time and it should still be allowed to be fuzzy. I think in like mathematics, we have like plate diagrams and like Markov chain diagrams and like recurrent states and all that. Some of that might come into this workflow world. But to be honest, I'm not too sure. I think right now, the first steps are just how do we take this DAG idea and break it down to modular components that we can like prompt better, have few shot examples for and ultimately like fine tune against. But in terms of even the UI, it's hard to say what it will likely win. I think, you know, people like Prefect and Zapier have a pretty good shot at doing a good job.Swyx [00:27:03]: Yeah. You seem to use Prefect a lot. I actually worked at a Prefect competitor at Temporal and I'm also very familiar with Dagster. What else would you call out as like particularly interesting in the AI engineering stack?Jason [00:27:13]: Man, I almost use nothing. I just use Cursor and like PyTests. Okay. I think that's basically it. You know, a lot of the observability companies have... The more observability companies I've tried, the more I just use Postgres.Swyx [00:27:29]: Really? Okay. Postgres for observability?Jason [00:27:32]: But the issue really is the fact that these observability companies isn't actually doing observability for the system. It's just doing the LLM thing. Like I still end up using like Datadog or like, you know, Sentry to do like latency. And so I just have those systems handle it. And then the like prompt in, prompt out, latency, token costs. I just put that in like a Postgres table now.Swyx [00:27:51]: So you don't need like 20 funded startups building LLM ops? Yeah.Jason [00:27:55]: But I'm also like an old, tired guy. You know what I mean? Like I think because of my background, it's like, yeah, like the Python stuff, I'll write myself. But you know, I will also just use Vercel happily. Yeah. Yeah. So I'm not really into that world of tooling, whereas I think, you know, I spent three good years building observability tools for recommendation systems. And I was like, oh, compared to that, Instructor is just one call. I just have to put time star, time and then count the prompt token, right? Because I'm not doing a very complex looping behavior. I'm doing mostly workflows and extraction. Yeah.Swyx [00:28:26]: I mean, while we're on this topic, we'll just kind of get this out of the way. You famously have decided to not be a venture backed company. You want to do the consulting route. The obvious route for someone as successful as Instructor is like, oh, here's hosted Instructor with all tooling. Yeah. You just said you had a whole bunch of experience building observability tooling. You have the perfect background to do this and you're not.Jason [00:28:43]: Yeah. Isn't that sick? I think that's sick.Swyx [00:28:44]: I mean, I know why, because you want to go free dive.Jason [00:28:47]: Yeah. Yeah. Because I think there's two things. Right. Well, one, if I tell myself I want to build requests, requests is not a venture backed startup. Right. I mean, one could argue whether or not Postman is, but I think for the most part, it's like having worked so much, I'm more interested in looking at how systems are being applied and just having access to the most interesting data. And I think I can do that more through a consulting business where I can come in and go, oh, you want to build perfect memory. You want to build an agent. You want to build like automations over construction or like insurance and supply chain, or like you want to handle writing private equity, mergers and acquisitions reports based off of user interviews. Those things are super fun. Whereas like maintaining the library, I think is mostly just kind of like a utility that I try to keep up, especially because if it's not venture backed, I have no reason to sort of go down the route of like trying to get a thousand integrations. In my mind, I just go like, okay, 98% of the people use open AI. I'll support that. And if someone contributes another platform, that's great. I'll merge it in. Yeah.Swyx [00:29:45]: I mean, you only added Anthropic support this year. Yeah.Jason [00:29:47]: Yeah. You couldn't even get an API key until like this year, right? That's true. Okay. If I add it like last year, I was trying to like double the code base to service, you know, half a percent of all downloads.Swyx [00:29:58]: Do you think the market share will shift a lot now that Anthropic has like a very, very competitive offering?Jason [00:30:02]: I think it's still hard to get API access. I don't know if it's fully GA now, if it's GA, if you can get a commercial access really easily.Alessio [00:30:12]: I got commercial after like two weeks to reach out to their sales team.Jason [00:30:14]: Okay.Alessio [00:30:15]: Yeah.Swyx [00:30:16]: Two weeks. It's not too bad. There's a call list here. And then anytime you run into rate limits, just like ping one of the Anthropic staff members.Jason [00:30:21]: Yeah. Then maybe we need to like cut that part out. So I don't need to like, you know, spread false news.Swyx [00:30:25]: No, it's cool. It's cool.Jason [00:30:26]: But it's a common question. Yeah. Surely just from the price perspective, it's going to make a lot of sense. Like if you are a business, you should totally consider like Sonnet, right? Like the cost savings is just going to justify it if you actually are doing things at volume. And yeah, I think the SDK is like pretty good. Back to the instructor thing. I just don't think it's a billion dollar company. And I think if I raise money, the first question is going to be like, how are you going to get a billion dollar company? And I would just go like, man, like if I make a million dollars as a consultant, I'm super happy. I'm like more than ecstatic. I can have like a small staff of like three people. It's fun. And I think a lot of my happiest founder friends are those who like raised a tiny seed round, became profitable. They're making like 70, 60, 70, like MRR, 70,000 MRR and they're like, we don't even need to raise the seed round. Let's just keep it like between me and my co-founder, we'll go traveling and it'll be a great time. I think it's a lot of fun.Alessio [00:31:15]: Yeah. like say LLMs / AI and they build some open source stuff and it's like I should just raise money and do this and I tell people a lot it's like look you can make a lot more money doing something else than doing a startup like most people that do a company could make a lot more money just working somewhere else than the company itself do you have any advice for folks that are maybe in a similar situation they're trying to decide oh should I stay in my like high paid FAANG job and just tweet this on the side and do this on github should I go be a consultant like being a consultant seems like a lot of work so you got to talk to all these people you know there's a lot to unpackJason [00:31:54]: I think the open source thing is just like well I'm just doing it purely for fun and I'm doing it because I think I'm right but part of being right is the fact that it's not a venture backed startup like I think I'm right because this is all you need right so I think a part of the philosophy is the fact that all you need is a very sharp blade to sort of do your work and you don't actually need to build like a big enterprise so that's one thing I think the other thing too that I've kind of been thinking around just because I have a lot of friends at google that want to leave right now it's like man like what we lack is not money or skill like what we lack is courage you should like you just have to do this a hard thing and you have to do it scared anyways right in terms of like whether or not you do want to do a founder I think that's just a matter of optionality but I definitely recognize that the like expected value of being a founder is still quite low it is right I know as many founder breakups and as I know friends who raised a seed round this year right like that is like the reality and like you know even in from that perspective it's been tough where it's like oh man like a lot of incubators want you to have co-founders now you spend half the time like fundraising and then trying to like meet co-founders and find co-founders rather than building the thing this is a lot of time spent out doing uh things I'm not really good at. I do think there's a rising trend in solo founding yeah.Swyx [00:33:06]: You know I am a solo I think that something like 30 percent of like I forget what the exact status something like 30 percent of starters that make it to like series B or something actually are solo founder I feel like this must have co-founder idea mostly comes from YC and most everyone else copies it and then plenty of companies break up over co-founderJason [00:33:27]: Yeah and I bet it would be like I wonder how much of it is the people who don't have that much like and I hope this is not a diss to anybody but it's like you sort of you go through the incubator route because you don't have like the social equity you would need is just sort of like send an email to Sequoia and be like hey I'm going on this ride you want a ticket on the rocket ship right like that's very hard to sell my message if I was to raise money is like you've seen my twitter my life is sick I've decided to make it much worse by being a founder because this is something I have to do so do you want to come along otherwise I want to fund it myself like if I can't say that like I don't need the money because I can like handle payroll and like hire an intern and get an assistant like that's all fine but I really don't want to go back to meta I want to like get two years to like try to find a problem we're solving that feels like a bad timeAlessio [00:34:12]: Yeah Jason is like I wear a YSL jacket on stage at AI Engineer Summit I don't need your accelerator moneyJason [00:34:18]: And boots, you don't forget the boots. But I think that is a part of it right I think it is just like optionality and also just like I'm a lot older now I think 22 year old Jason would have been probably too scared and now I'm like too wise but I think it's a matter of like oh if you raise money you have to have a plan of spending it and I'm just not that creative with spending that much money yeah I mean to be clear you just celebrated your 30th birthday happy birthday yeah it's awesome so next week a lot older is relative to some some of the folks I think seeing on the career tipsAlessio [00:34:48]: I think Swix had a great post about are you too old to get into AI I saw one of your tweets in January 23 you applied to like Figma, Notion, Cohere, Anthropic and all of them rejected you because you didn't have enough LLM experience I think at that time it would be easy for a lot of people to say oh I kind of missed the boat you know I'm too late not gonna make it you know any advice for people that feel like thatJason [00:35:14]: Like the biggest learning here is actually from a lot of folks in jiu-jitsu they're like oh man like is it too late to start jiu-jitsu like I'll join jiu-jitsu once I get in more shape right it's like there's a lot of like excuses and then you say oh like why should I start now I'll be like 45 by the time I'm any good and say well you'll be 45 anyways like time is passing like if you don't start now you start tomorrow you're just like one more day behind if you're worried about being behind like today is like the soonest you can start right and so you got to recognize that like maybe you just don't want it and that's fine too like if you wanted you would have started I think a lot of these people again probably think of things on a too short time horizon but again you know you're gonna be old anyways you may as well just start now you knowSwyx [00:35:55]: One more thing on I guess the um career advice slash sort of vlogging you always go viral for this post that you wrote on advice to young people and the lies you tell yourself oh yeah yeah you said you were writing it for your sister.Jason [00:36:05]: She was like bummed out about going to college and like stressing about jobs and I was like oh and I really want to hear okay and I just kind of like text-to-sweep the whole thing it's crazy it's got like 50,000 views like I'm mind I mean your average tweet has more but that thing is like a 30-minute read nowSwyx [00:36:26]: So there's lots of stuff here which I agree with I you know I'm also of occasionally indulge in the sort of life reflection phase there's the how to be lucky there's the how to have high agency I feel like the agency thing is always a trend in sf or just in tech circles how do you define having high agencyJason [00:36:42]: I'm almost like past the high agency phase now now my biggest concern is like okay the agency is just like the norm of the vector what also matters is the direction right it's like how pure is the shot yeah I mean I think agency is just a matter of like having courage and doing the thing that's scary right you know if people want to go rock climbing it's like do you decide you want to go rock climbing then you show up to the gym you rent some shoes and you just fall 40 times or do you go like oh like I'm actually more intelligent let me go research the kind of shoes that I want okay like there's flatter shoes and more inclined shoes like which one should I get okay let me go order the shoes on Amazon I'll come back in three days like oh it's a little bit too tight maybe it's too aggressive I'm only a beginner let me go change no I think the higher agent person just like goes and like falls down 20 times right yeah I think the higher agency person is more focused on like process metrics versus outcome metrics right like from pottery like one thing I learned was if you want to be good at pottery you shouldn't count like the number of cups or bowls you make you should just weigh the amount of clay you use right like the successful person says oh I went through 100 pounds of clay right the less agency was like oh I've made six cups and then after I made six cups like there's not really what are you what do you do next no just pounds of clay pounds of clay same with the work here right so you just got to write the tweets like make the commits contribute open source like write the documentation there's no real outcome it's just a process and if you love that process you just get really good at the thing you're doingSwyx [00:38:04]: yeah so just to push back on this because obviously I mostly agree how would you design performance review systems because you were effectively saying we can count lines of code for developers rightJason [00:38:15]: I don't think that would be the actual like I think if you make that an outcome like I can just expand a for loop right I think okay so for performance review this is interesting because I've mostly thought of it from the perspective of science and not engineering I've been running a lot of engineering stand-ups primarily because there's not really that many machine learning folks the process outcome is like experiments and ideas right like if you think about outcome is what you might want to think about an outcome is oh I want to improve the revenue or whatnot but that's really hard but if you're someone who is going out like okay like this week I want to come up with like three or four experiments I might move the needle okay nothing worked to them they might think oh nothing worked like I suck but to me it's like wow you've closed off all these other possible avenues for like research like you're gonna get to the place that you're gonna figure out that direction really soon there's no way you try 30 different things and none of them work usually like 10 of them work five of them work really well two of them work really really well and one thing was like the nail in the head so agency lets you sort of capture the volume of experiments and like experience lets you figure out like oh that other half it's not worth doing right I think experience is going like half these prompting papers don't make any sense just use chain of thought and just you know use a for loop that's basically right it's like usually performance for me is around like how many experiments are you running how oftentimes are you trying.Alessio [00:39:32]: When do you give up on an experiment because a StitchFix you kind of give up on language models I guess in a way as a tool to use and then maybe the tools got better you were right at the time and then the tool improved I think there are similar paths in my engineering career where I try one approach and at the time it doesn't work and then the thing changes but then I kind of soured on that approach and I don't go back to it soonJason [00:39:51]: I see yeah how do you think about that loop so usually when I'm coaching folks and as they say like oh these things don't work I'm not going to pursue them in the future like one of the big things like hey the negative result is a result and this is something worth documenting like this is an academia like if it's negative you don't just like not publish right but then like what do you actually write down like what you should write down is like here are the conditions this is the inputs and the outputs we tried the experiment on and then one thing that's really valuable is basically writing down under what conditions would I revisit these experiments these things don't work because of what we had at the time if someone is reading this two years from now under what conditions will we try again that's really hard but again that's like another skill you kind of learn right it's like you do go back and you do experiments you figure out why it works now I think a lot of it here is just like scaling worked yeah rap lyrics you know that was because I did not have high enough quality data if we phase shift and say okay you don't even need training data oh great then it might just work a different domainAlessio [00:40:48]: Do you have anything in your list that is like it doesn't work now but I want to try it again later? Something that people should maybe keep in mind you know people always like agi when you know when are you going to know the agi is here maybe it's less than that but any stuff that you tried recently that didn't work thatJason [00:41:01]: You think will get there I mean I think the personal assistance and the writing I've shown to myself it's just not good enough yet so I hired a writer and I hired a personal assistant so now I'm gonna basically like work with these people until I figure out like what I can actually like automate and what are like the reproducible steps but like I think the experiment for me is like I'm gonna go pay a person like thousand dollars a month that helped me improve my life and then let me get them to help me figure like what are the components and how do I actually modularize something to get it to work because it's not just like a lot gmail calendar and like notion it's a little bit more complicated than that but we just don't know what that is yet those are two sort of systems that I wish gb4 or opus was actually good enough to just write me an essay but most of the essays are still pretty badSwyx [00:41:44]: yeah I would say you know on the personal assistance side Lindy is probably the one I've seen the most flow was at a speaker at the summit I don't know if you've checked it out or any other sort of agents assistant startupJason [00:41:54]: Not recently I haven't tried lindy they were not ga last time I was considering it yeah yeah a lot of it now it's like oh like really what I want you to do is take a look at all of my meetings and like write like a really good weekly summary email for my clients to remind them that I'm like you know thinking of them and like working for them right or it's like I want you to notice that like my monday is like way too packed and like block out more time and also like email the people to do the reschedule and then try to opt in to move them around and then I want you to say oh jason should have like a 15 minute prep break after form back to back those are things that now I know I can prompt them in but can it do it well like before I didn't even know that's what I wanted to prompt for us defragging a calendar and adding break so I can like eat lunch yeah that's the AGI test yeah exactly compassion right I think one thing that yeah we didn't touch on it before butAlessio [00:42:44]: I think was interesting you had this tweet a while ago about prompts should be code and then there were a lot of companies trying to build prompt engineering tooling kind of trying to turn the prompt into a more structured thing what's your thought today now you want to turn the thinking into DAGs like do prompts should still be code any updated ideasJason [00:43:04]: It's the same thing right I think you know with Instructor it is very much like the output model is defined as a code object that code object is sent to the LLM and in return you get a data structure so the outputs of these models I think should also be code objects and the inputs somewhat should be code objects but I think the one thing that instructor tries to do is separate instruction data and the types of the output and beyond that I really just think that most of it should be still like managed pretty closely to the developer like so much of is changing that if you give control of these systems away too early you end up ultimately wanting them back like many companies I know that I reach out or ones were like oh we're going off of the frameworks because now that we know what the business outcomes we're trying to optimize for these frameworks don't work yeah because we do rag but we want to do rag to like sell you supplements or to have you like schedule the fitness appointment the prompts are kind of too baked into the systems to really pull them back out and like start doing upselling or something it's really funny but a lot of it ends up being like once you understand the business outcomes you care way more about the promptSwyx [00:44:07]: Actually this is fun in our prep for this call we were trying to say like what can you as an independent person say that maybe me and Alessio cannot say or me you know someone at a company say what do you think is the market share of the frameworks the LangChain, the LlamaIndex, the everything...Jason [00:44:20]: Oh massive because not everyone wants to care about the code yeah right I think that's a different question to like what is the business model and are they going to be like massively profitable businesses right making hundreds of millions of dollars that feels like so straightforward right because not everyone is a prompt engineer like there's so much productivity to be captured in like back office optim automations right it's not because they care about the prompts that they care about managing these things yeah but those would be sort of low code experiences you yeah I think the bigger challenge is like okay hundred million dollars probably pretty easy it's just time and effort and they have the manpower and the money to sort of solve those problems again if you go the vc route then it's like you're talking about billions and that's really the goal that stuff for me it's like pretty unclear but again that is to say that like I sort of am building things for developers who want to use infrastructure to build their own tooling in terms of the amount of developers there are in the world versus downstream consumers of these things or even just think of how many companies will use like the adobes and the ibms right because they want something that's fully managed and they want something that they know will work and if the incremental 10% requires you to hire another team of 20 people you might not want to do it and I think that kind of organization is really good for uh those are bigger companiesSwyx [00:45:32]: I just want to capture your thoughts on one more thing which is you said you wanted most of the prompts to stay close to the developer and Hamel Husain wrote this post which I really love called f you show me the prompt yeah I think he cites you in one of those part of the blog post and I think ds pi is kind of like the complete antithesis of that which is I think it's interesting because I also hold the strong view that AI is a better prompt engineer than you are and I don't know how to square that wondering if you have thoughtsJason [00:45:58]: I think something like DSPy can work because there are like very short-term metrics to measure success right it is like did you find the pii or like did you write the multi-hop question the correct way but in these workflows that I've been managing a lot of it are we minimizing churn and maximizing retention yeah that's a very long loop it's not really like a uptuna like training loop right like those things are much more harder to capture so we don't actually have those metrics for that right and obviously we can figure out like okay is the summary good but like how do you measure the quality of the summary it's like that feedback loop it ends up being a lot longer and then again when something changes it's really hard to make sure that it works across these like newer models or again like changes to work for the current process like when we migrate from like anthropic to open ai like there's just a ton of change that are like infrastructure related not necessarily around the prompt itself yeah cool any other ai engineering startups that you think should not exist before we wrap up i mean oh my gosh i mean a lot of it again it's just like every time of investors like how does this make a billion dollars like it doesn't i'm gonna go back to just like tweeting and holding my breath underwater yeah like i don't really pay attention too much to most of this like most of the stuff i'm doing is around like the consumer of like llm calls yep i think people just want to move really fast and they will end up pick these vendors but i don't really know if anything has really like blown me out the water like i only trust myself but that's also a function of just being an old man like i think you know many companies are definitely very happy with using most of these tools anyways but i definitely think i occupy a very small space in the engineering ecosystem.Swyx [00:47:41]: Yeah i would say one of the challenges here you know you call about the dealing in the consumer of llm's space i think that's what ai engineering differs from ml engineering and i think a constant disconnect or cognitive dissonance in this field in the ai engineers that have sprung up is that they are not as good as the ml engineers they are not as qualified i think that you know you are someone who has credibility in the mle space and you are also a very authoritative figure in the ai space and i think so and you know i think you've built the de facto leading library i think yours i think instructors should be part of the standard lib even though i try to not use it like i basically also end up rebuilding instructor right like that's a lot of the back and forth that we had over the past two days i think that's the fundamental thing that we're trying to figure out like there's very small supply of MLEs not everyone's going to have that experience that you had but the global demand for AI is going to far outstrip the existing MLEs.Jason [00:48:36]: So what do we do do we force everyone to go through the standard MLE curriculum or do we make a new one? I'

god new york ai man japan advice state san francisco design evolution single numbers chatgpt code tesla defining ga flight identifying agency consulting agent structure concept spark models cto vc react nlp instructors openai function sf residence api backed gpt python ui waterloo emergence workflow notion apis clip frameworks temporal llm opus versus structured prompts agi elo sequoia django dag zapier ide usf dags hex guardrails rag postman haiku anthropic figma sonnets sdks sentry alessio yc rejections stitch fix ysl query extracting json faang xml cursor mrr looping datadog pii outputs etl markov prefect postgres joshua brown cohere mle elicit autogpt gpd gbt langchain ai engineer toolthe simon willison xgboost dagster latent space swix jerry liu babyagi

Designing A Non-Relational Database Engine

Data Engineering Podcast

Play Episode Listen Later Apr 14, 2024 76:01

Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine Interview Introduction How did you get involved in the area of data management? Can you describe what constitutes a NoSQL database? How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago? What are the factors that convince teams to use a NoSQL vs. SQL database? NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus? How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines? When designing and building a database, what are the initial set of questions that need to be answered? How many "core capabilities" can you reasonably design around before they conflict with each other? How have you approached the evolution of RavenDB as you add new capabilities and mature the project? What are some of the early decisions that had to be unwound to enable new capabilities? If you were to start from scratch today, what database would you build? What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RavenDB? When is a NoSQL database/RavenDB the wrong choice? What do you have planned for the future of RavenDB? Contact Info Blog (https://ayende.com/blog/) LinkedIn (https://www.linkedin.com/in/ravendb/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links RavenDB (https://ravendb.net/) RSS (https://en.wikipedia.org/wiki/RSS) Object Relational Mapper (ORM) (https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping) Relational Database (https://en.wikipedia.org/wiki/Relational_database) NoSQL (https://en.wikipedia.org/wiki/NoSQL) CouchDB (https://couchdb.apache.org/) Navigational Database (https://en.wikipedia.org/wiki/Navigational_database) MongoDB (https://www.mongodb.com/) Redis (https://redis.io/) Neo4J (https://neo4j.com/) Cassandra (https://cassandra.apache.org/_/index.html) Column-Family (https://en.wikipedia.org/wiki/Column_family) SQLite (https://www.sqlite.org/) LevelDB (https://github.com/google/leveldb) Firebird DB (https://firebirdsql.org/) fsync (https://man7.org/linux/man-pages/man2/fsync.2.html) Esent DB? (https://learn.microsoft.com/en-us/windows/win32/extensible-storage-engine/extensible-storage-engine-managed-reference) KNN == K-Nearest Neighbors (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) RocksDB (https://rocksdb.org/) C# Language (https://en.wikipedia.org/wiki/C_Sharp_(programming_language)) ASP.NET (https://en.wikipedia.org/wiki/ASP.NET) QUIC (https://en.wikipedia.org/wiki/QUIC) Dynamo Paper (https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) Database Internals (https://amzn.to/49A5wjF) book (affiliate link) Designing Data Intensive Applications (https://amzn.to/3JgCZFh) book (affiliate link) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

Play Episode Listen Later Apr 7, 2024 56:23

Summary Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform Interview Introduction How did you get involved in the area of data management? Can you start by outlining the technical elements of what it means to have a "semantic layer"? In the past couple of years there was a rapid hype cycle around the "metrics layer" and "headless BI", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts? What are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.) At what point does it become necessary/beneficial for a team to adopt such a service? What are the challenges involved in retrofitting a semantic layer into a production data system? evolution of requirements/usage patterns technical complexities/performance and cost optimization What are the most interesting, innovative, or unexpected ways that you have seen Cube used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube? When is Cube/a semantic layer the wrong choice? What do you have planned for the future of Cube? Contact Info LinkedIn (https://www.linkedin.com/in/keydunov/) keydunov (https://github.com/keydunov) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Cube (https://cube.dev/) Semantic Layer (https://en.wikipedia.org/wiki/Semantic_layer) Business Objects (https://en.wikipedia.org/wiki/BusinessObjects) Tableau (https://www.tableau.com/) Looker (https://cloud.google.com/looker/?hl=en) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) Mode (https://mode.com/) Thoughtspot (https://www.thoughtspot.com/) LightDash (https://www.lightdash.com/) Podcast Episode (https://www.dataengineeringpodcast.com/lightdash-exploratory-business-intelligence-episode-232/) Embedded Analytics (https://en.wikipedia.org/wiki/Embedded_analytics) Dimensional Modeling (https://en.wikipedia.org/wiki/Dimensional_modeling) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Druid (https://druid.apache.org/) BigQuery (https://cloud.google.com/bigquery?hl=en) Starburst (https://www.starburst.io/) Pinot (https://pinot.apache.org/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Arrow Datafusion (https://arrow.apache.org/datafusion/) Metabase (https://www.metabase.com/) Podcast Episode (https://www.dataengineeringpodcast.com/metabase-with-sameer-al-sakran-episode-29) Superset (https://superset.apache.org/) Alation (https://www.alation.com/) Collibra (https://www.collibra.com/) Podcast Episode (https://www.dataengineeringpodcast.com/collibra-enterprise-data-governance-episode-188) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Data Engineering Podcast

Play Episode Listen Later Mar 31, 2024 50:44

Summary Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Your host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help Interview Introduction How did you get involved in the area of data management? Can you start by outlining what elements of observability are most relevant for dbt projects? What are some of the common ad-hoc/DIY methods that teams develop to acquire those insights? What are the challenges/shortcomings associated with those approaches? Over the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools? What are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle? Can you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects? How is Elementary designed/implemented? How have the scope and goals of the project changed since you started working on it? What are the engineering challenges/frustrations that you have dealt with in the creation and evolution of Elementary? Can you talk us through the setup and workflow for teams adopting Elementary in their dbt projects? How does the incorporation of Elementary change the development habits of the teams who are using it? What are the most interesting, innovative, or unexpected ways that you have seen Elementary used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Elementary? When is Elementary the wrong choice? What do you have planned for the future of Elementary? Contact Info LinkedIn (https://www.linkedin.com/in/maayansa/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Elementary (https://www.elementary-data.com/) Data Observability (https://www.montecarlodata.com/blog-what-is-data-observability/) dbt (https://www.getdbt.com/) Datadog (https://www.datadoghq.com/) pre-commit (https://pre-commit.com/) dbt packages (https://docs.getdbt.com/docs/build/packages) SQLMesh (https://sqlmesh.readthedocs.io/en/latest/) Malloy (https://www.malloydata.dev/) SDF (https://www.sdf.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

Play Episode Listen Later Mar 24, 2024 55:39

Summary A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Pete Hunt about how the launch of Dagster+ will level up your data platform and orchestrate across language platforms Interview Introduction How did you get involved in the area of data management? Can you describe what the focus of Dagster+ is and the story behind it? What problems are you trying to solve with Dagster+? What are the notable enhancements beyond the Dagster Core project that this updated platform provides? How is it different from the current Dagster Cloud product? In the launch announcement you tease new capabilities that would be great to explore in turns: Make data a team sport, enabling data teams across the organization Deliver reliable, high quality data the organization can trust Observe and manage data platform costs Master the heterogeneous collection of technologies—both traditional and Modern Data Stack What are the business/product goals that you are focused on improving with the launch of Dagster+ What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the design and launch of Dagster+? When is Dagster+ the wrong choice? What do you have planned for the future of Dagster/Dagster Cloud/Dagster+? Contact Info Twitter (https://twitter.com/floydophone) LinkedIn (https://linkedin.com/in/pwhunt) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-data-applications-episode-104) Dagster+ Launch Event (https://dagster.io/events/dagster-plus-launch-event) Hadoop (https://hadoop.apache.org/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Pydantic (https://docs.pydantic.dev/latest/) Software Defined Assets (https://docs.dagster.io/concepts/assets/software-defined-assets) Dagster Insights (https://docs.dagster.io/dagster-cloud/insights) Dagster Pipes (https://docs.dagster.io/guides/dagster-pipes) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) Data Mesh (https://www.datamesh-architecture.com/) Dagster Code Locations (https://docs.dagster.io/concepts/code-locations) Dagster Asset Checks (https://docs.dagster.io/concepts/assets/asset-checks) Dave & Buster's (https://www.daveandbusters.com/us/en/home) SQLMesh (https://sqlmesh.readthedocs.io/en/latest/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) SDF (https://www.sdf.com/) Malloy (https://www.malloydata.dev/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

#454: Data Pipelines with Dagster

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later Mar 21, 2024 58:25

Do you have data that you pull from external sources or is generated and appears at your digital doorstep? I bet that data needs processed, filtered, transformed, distributed, and much more. One of the biggest tools to create these data pipelines with Python is Dagster. And we are fortunate to have Pedram Navid on the show this episode. Pedram is the Head of Data Engineering and DevRel at Dagster Labs. And we're talking data pipelines this week at Talk Python. Episode sponsors Talk Python Courses Posit Links from the show Rock Solid Python with Types Course: training.talkpython.fm Pedram on Twitter: twitter.com Pedram on LinkedIn: linkedin.com Ship data pipelines with extraordinary velocity: dagster.io dagster-open-platform: github.com The Dagster Master Plan: dagster.io data load tool (dlt): dlthub.com DataFrames for the new era: pola.rs Apache Arrow: arrow.apache.org DuckDB is a fast in-process analytical database: duckdb.org Ship trusted data products faster: www.getdbt.com Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy

2827: Data, Decisions, and Dagster: Nick Schrock's Blueprint for Engineering Excellence

The Tech Blog Writer Podcast

Play Episode Listen Later Mar 10, 2024 25:13

Nick Schrock, the innovative mind behind Dagster Labs and the renowned co-creator of GraphQL, joins me on Tech Talks Daily. Nick takes us through his illustrious journey from his foundational days at Facebook, where he spearheaded the Product Infrastructure team, to his visionary leap into solving some of the most pressing issues facing data and ML engineering today through Dagster, his open-source data orchestration platform. Nick shares insights from his experience at Facebook, elaborating on how internal tools like React and GraphQL revolutionized the company's development practices and set new benchmarks for the developer community worldwide. His transition from Facebook to founding Dagster Labs was driven by a deep-seated desire to address the complexities and inefficiencies in data infrastructure, a challenge he identified as a critical pain point for engineers across industries. Throughout the conversation, Nick delves into the core areas of data orchestration, highlighting the importance of enabling practitioners to have end-to-end ownership of data pipelines without the need for a centralized team. This approach, he argues, is pivotal in the era where data and ML engineering are becoming fundamental to decision-making processes both in human and business contexts. Much of the discussion is dedicated to exploring the future of open source in the SaaS-dominated landscape and the operational convergence of ML, AI, and data engineering. Nick emphasizes the delicate balance required in managing an open-core business model and shares personal anecdotes about the "Engineering Founder's Dilemma" — the intricate dance between leading the vision and running the company. Listeners will gain a unique perspective on the evolution of data platforms and engineering, underscored by Nick's advocacy for a robust, community-driven approach to open-source development. He sheds light on the challenges and rewards of building a platform like Dragster, which aims to simplify and democratize data infrastructure for companies of all sizes. Nick also advises technical founders on maintaining equilibrium between their visionary roles and their companies' operational demands. This episode is a deep dive into the mechanics of data orchestration and a masterclass in leadership, innovation, and the transformative power of open-source projects in addressing complex engineering challenges.

ai data decisions excellence engineering dilemma blueprint saas react ml graphql schrock dragster dagster tech talks daily

From Concept to Market: The PMF Journey of Dagster

Rocketship.fm

Play Episode Listen Later Feb 22, 2024 31:17

On todays episdoe we discuss Dagster, a data orchestration platform developed by Dagster Labs, founded by Nick Schrock in 2018. Schrock, known for co-founding GraphQL, envisioned Dagster as a solution to the challenges of managing complex data pipelines reliably and at scale. The discussion centers on exploring Dagster's history, particularly focusing on the journey to finding product-market fit. Sacca mentions his recent interview with Pedram Navid, the Head of Data Engineering and DevRel at Dagster Labs, highlighting Navid's extensive experience in the data industry. Sacca expresses anticipation for discussing Dagster's journey to finding initial product-market fit, indicating relevance for product-focused audiences. The conversation also touches on the concept of "quantified organizations," a burgeoning trend in the data space. The dialogue concludes with a teaser for related discussions and insights to follow in the podcast episode. This podcast is brought to you by: Miro: Go to Miro.com/podcast and get your first three Miro boards free forever. Hubspot: Listen to The Science of Scaling wherever you get your podcasts. Gigantic: Learn more about Gigantic's Product Leadership course, Generative AI, Product Management course, Executive Leadership course & Web3 for Marketing course, AI Product Management Course, and Customer Research and Discovery Course at Gigantic.is Rocketship is brought to you by The Podglomerate. *** Previous Guests include Seth Goden, Christian Idioti, Ash Maurya, Dan Shapiro of Glowforge, Lolita Taub, Amy Hood of Hoodzpah, Amanda Goetz, Helen Tran, Ben Parr, Mac Conwell, Charli Marie Prangley of ConvertKit, Kandis O'Brian, Laura Roeder, Brenna Loury of Doist, Lopa van der Mersch of Rasa, Ken Norton, Randy Silver, Sanjiv Kalevar of OpenView Venture Partners, Dan Olsen, Jay Clouse, Melissa Perri, Dheerja Kaur of Robinhood, Rahul Vohra of Superhuman, Rich Mironov, Ben Foster, ChatGPT, Ron Weiner of Earth Class Mail. *** This show is a part of the Podglomerate network, a company that produces, distributes, and monetizes podcasts. We encourage you to visit the website and sign up for our newsletter for more information about our shows, launches, and events. For more information on how The Podglomerate treats data, please see our Privacy Policy. Since you're listening to Rocketship, we'd like to suggest you also try other Podglomerate shows surrounding entrepreneurship, business, and careers like Creative Elements and Freelance to Founder. Learn more about your ad choices. Visit megaphone.fm/adchoices

Episode 65: Scaling Data Pipelines with Nick Schrock, Founder/CTO of Dagster Labs

Open Source Underdogs

Play Episode Listen Later Feb 14, 2024 35:44

Intro Mike Schwartz: Hello and welcome to Open Source Underdogs! I’m your host Mike Schwartz, and this is episode 65 with Nick Schrock, Founder and CTO of Dagster, a platform that helps companies create data pipelines, which is critical to transform and update data in order to make it useful, for example, to generate reports, content,... The post Episode 65: Scaling Data Pipelines with Nick Schrock, Founder/CTO of Dagster Labs first appeared on Open Source Underdogs.

founders data scaling cto labs pipelines schrock founder cto mike schwartz dagster

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

Play Episode Listen Later Feb 11, 2024 59:55

Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing Interview Introduction How did you get involved in the area of data management? Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation? What is the current state of the ecosystem for data sharing protocols/practices/platforms? What are some of the main challenges/shortcomings that teams/organizations experience with these options? What are the technical capabilities that need to be present for an effective data sharing solution? How does that change as a function of the type of data? (e.g. tabular, image, etc.) What are the requirements around governance and auditability of data access that need to be addressed when sharing data? What are the typical boundaries along which data access requires special consideration for how the sharing is managed? Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform? What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing? When is Bobsled the wrong choice? What do you have planned for the future of data sharing? Contact Info LinkedIn (https://www.linkedin.com/in/andyjefferson/?originalSubdomain=de) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Bobsled (https://www.bobsled.co/) OLAP == OnLine Analytical Processing (https://en.wikipedia.org/wiki/Online_analytical_processing) Cassandra (https://cassandra.apache.org/_/index.html) Podcast Episode (https://www.dataengineeringpodcast.com/cassandra-global-scale-database-episode-220) Neo4J (https://neo4j.com/) FTP == File Transfer Protocol (https://en.wikipedia.org/wiki/File_Transfer_Protocol) S3 Access Points (https://aws.amazon.com/s3/features/access-points/) Snowflake Sharing (https://docs.snowflake.com/en/guides-overview-sharing) BigQuery Sharing (https://cloud.google.com/bigquery/docs/authorized-datasets) Databricks Delta Sharing (https://www.databricks.com/product/delta-sharing) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering, AI, Entrepreneurship - Unveiling 2024's Tech Horizons, A Conversation With Nick Schrock - SPaMCAST 794

Software Process and Measurement Cast

Play Episode Listen Later Feb 11, 2024 32:02

The Software Process and Measurement Cast 794 features our conversation with Nick Schrock. Nick and I discussed data engineering, AI, and entrepreneurship. As the Founder at Dagster Labs, Nick has the perfect standing to talk about three of the hottest topics of 2024. Nick's Bio: Nick is the Founder and CTO of Dagster Labs, the company behind Dagster, a popular open-source data orchestration platform. Before Dagster Labs, he was a Principal Engineer and Director at Facebook from 2009 to 2017, where he founded the Product Infrastructure team and co-created GraphQL. After cutting his teeth at Facebook, he pursued his passion for working on engineers' pain points after hearing that data infrastructure was a big issue. He founded Dagster to address this issue, highlighting how quickly open-source projects were able to make an impact at legacy companies. Dagster couldn't come at a more critical time as data and ML engineering are beginning to drive both human and business decision-making. Contact Information: Linkedin: Website: Stop Project Chaos: The Ultimate Guide to Predictable Work Intake Feeling overwhelmed by endless tasks and unpredictable deadlines? You're not alone. But what if you could transform your project into a well-oiled machine, delivering consistent value on time? Introducing Mastering Work Intake: From Chaos to Predictable Delivery. Ditch the black hole of endless requests. Discover practical strategies for capturing, filtering, and prioritizing work effectively. Gain control of your work pipeline. Learn how to navigate different types of work at different stages, ensuring smooth flow and efficient delivery. Become a master of saying “no”. This book equips you with the tools and techniques to confidently decline unproductive work and protect your team's focus. Stop the chaos and start delivering! Grab your copy of Mastering Work Intake today and unleash your project's full potential. JRoss: or Amazon (US): Are you ready to commit to not letting bad work drag you down? Invest in your success and learn how to master work intake for a smoother, more rewarding project experience. Join our cohort-based workshop. The first cohort is starting on March 1st, 2024. Details at Re-read Saturday News Chapter 8 shatters myths! Once upon a time in a land far, far away, I believed a few of these myths. Ok, it was just about five years ago that I was dissuaded from the last of my misbeliefs. I believe that many of my misconceptions are founded in the collision of words between classical statistics and words Shewhart used for Process Behavior Charts (PBC). Buy a copy and get reading – . Week 1: – Week 2: – Week 3: – Week 4: – Week 5: - Week 6: - Week 7: - Week 8: XmR Charts and the Four Basic Metrics of Flow - Week 9: - Next SPaMCAST In the Software Process and Measurement Cast 795, why do teams, products, projects, and companies exist? Why do we care? (We really should!) We will also have a visit from Susan Parente who brings her Not A Scrumdamentalist column to the podcast!

director founders ai conversations discover tech entrepreneurship invest unveiling cto ditch ml horizons graphql principal engineer data engineering schrock jross dagster

Learning and Sharing in Public with Dagster Lab's Pedram Navid

Partially Redacted: Data Privacy, Security & Compliance

Play Episode Listen Later Feb 7, 2024 35:58

In this episode Sean is joined by Pedram Naveed, Head of Data Engineering at Dagster Labs. They discuss the unique challenges and opportunities in the realm of data engineering, particularly the culture of learning and sharing within the field. Pedram discusses the traditionally guarded nature of data engineering, contrasting it with the more open-source approach in software engineering. He highlights the potential downsides of this secrecy, such as the difficulty in learning best practices and innovating. The discussion also touches on the balance companies must strike between contributing to communal knowledge and protecting valuable data and intellectual property. Pedram shares insights from his experiences at Dagster Labs, including the development of the Dagster Open Platform and its impact on fostering a culture of openness in data engineering. Additionally, they explore the future of collaboration in the field, considering emerging technologies and methodologies that could further encourage sharing and innovation over the next 5-10 years. Links: Dagster Open Platform Pedram Navid

head learning sharing public navid data engineering pedram dagster

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

Play Episode Listen Later Feb 4, 2024 56:55

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3 Interview Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses? What are some of the platforms/architectures that teams are replacing with RisingWave? What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented? How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project? What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave? What are the user/developer experience elements that you have prioritized most highly? What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave? Contact Info yingjunwu (https://github.com/yingjunwu) on GitHub Personal Website (https://yingjunwu.github.io/) LinkedIn (https://www.linkedin.com/in/yingjun-wu-4b584536/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links RisingWave (https://risingwave.com/) AWS Redshift (https://aws.amazon.com/redshift/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Druid (https://druid.apache.org/) Materialize (https://materialize.com/) Spark (https://spark.apache.org/) Trino (https://trino.io/) Snowflake (https://www.snowflake.com/en/) Kafka (https://kafka.apache.org/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) Postgres (https://www.postgresql.org/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Bridging the Developer Education Gap with Tim Castillo of Dagster Labs

The State of Developer Education

Play Episode Listen Later Jan 25, 2024 36:13

In this week's episode, Jon is joined by Tim Castillo, Data Engineer and Developer Advocate for Dagster Labs. Tim is a technology enthusiast and expert with a strong foundation in computer science and data engineering. Carving out his career trajectory in the bustling tech community of the 2010s, he comes with an arsenal of experience shaped by early massive open online courses. In this episode, we get to hear about Tim's journey from college undergraduate to tech advocate, sharing insightful strategies for upskilling technologists and building adaptable, robust tech communities.

education developers bridging labs castillo carving developer advocate data engineers dagster

The role of AI and LLMs in data -- Pedram Navid // Dagster Labs

Data Driven

Play Episode Listen Later Jan 23, 2024 22:26

Pedram Navid, Head of Data Engineering and DevRel at Dagster Labs, explores data products and AI and large language models' roles in data. While AI has limitations, it offers data practitioners new ways to explore and leverage data, transforming the analysis process. Further, AI and LLMs democratize access to data analysis for non-technical people, enabling them to explore and derive insights from complex datasets without feeling overwhelmed. Today, Pedram discusses the role of AI and large language models in data.Connect With: Pedram Navid: Website // LinkedInData Driven Podcast: Email // LinkedIn // TwitterDiedre Downing: Website // LinkedIn // Twitter

head learning ai data presentation excel labs visualizations professional development business intelligence analytic navid data engineering devrel pedram dagster

Cutting through the noise of data products -- Pedram Navid // Dagster Labs

Data Driven

Play Episode Listen Later Jan 22, 2024 20:17

Pedram Navid, Head of Data Engineering and DevRel at Dagster Labs, explores data products and AI and large language models' roles in data. With the rising data literacy levels in organizations, leaders are increasingly inclined to access data independently. Shifting towards viewing data as a product, data teams can strategically enable self-service and empower them to find the answers they need to make business decisions. Today, Pedram discusses cutting through the noise of data products.Connect With: Pedram Navid: Website // LinkedInData Driven Podcast: Email // LinkedIn // TwitterDiedre Downing: Website // LinkedIn // Twitter

head learning ai data shifting products presentation excel labs visualizations professional development business intelligence analytic navid data engineering devrel cutting through the noise pedram dagster

Revolutionizing Data Engineering: Dagster's Journey and Open-Source Impact with CTO Nick Schrock | Ep 789

The Digital Executive

Play Episode Listen Later Jan 18, 2024 15:18

In this enlightening episode of The Digital Executive Podcast, host Brian Thomas welcomes Nick Schrock, the innovative mind behind Dagster Labs. Nick delves into his impressive journey from a principal engineer and director at Facebook to founding Dagster Labs, driven by his desire to solve engineering pain points. He explains how his past experiences, including co-creating GraphQL, shaped his approach to addressing complex data infrastructure challenges.Dagster, a prominent open-source data orchestration platform, is at the heart of their discussion. Nick elaborates on how Dagster redefines data engineering by prioritizing asset-oriented workflows over task-focused ones, ensuring more robust and intuitive data management. He emphasizes the platform's unique approach to consolidating tools, improving lineage tracking, and enhancing operational context, all contributing to a more streamlined and effective data engineering process.Furthermore, Nick highlights the significant role of open-source projects in shaping the tech industry, particularly in legacy companies. He shares insights into how these projects foster community building and empower engineers to adopt new technologies more freely.The podcast concludes with Nick's perspective on the future of data and ML engineering. He anticipates a shift towards greater accuracy and productivity, with Dagster strategically positioned to address these evolving needs by integrating software engineering principles into data engineering.Overall, the episode offers a deep dive into the transformative impact of Dagster in the data engineering landscape and the broader implications for the tech industry.

revolutionizing open source ml graphql data engineering brian thomas schrock dagster

171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

The Data Stack Show

Play Episode Listen Later Jan 3, 2024 55:50

Highlights from this week's conversation include:The role of an orchestrator in the lifecycle of data (1:34)Relevance of orchestration in data pipelines (00:02:45)Changes around data ops and MLOps (3:37)Data Cleaning (11:42)Overview of Dagster (13:50)Assets vs Tasks in Data Pipeline (19:15)Building a Data Pipeline with Dexter (25:40)Difference between Data Asset and Materialized Dataset (28:28)Defining Lineage and Data Assets in Dagster (29:32)The boundaries of software and organizational structures (37:25)The benefits of a unified orchestration framework (39:56)Orchestration in the development phase (45:29)The emergence of analytics engineer role (51:53)Fluidity in data pipeline and infrastructure roles (52:40)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

building data tasks assets machine learning relevance pipelines cdp orchestration fluidity ryza dagster rudderstack

The PRQL: Does Machine Learning Need Its Own Orchestrator? Featuring Sandy Ryza of Dagster

The Data Stack Show

Play Episode Listen Later Jan 2, 2024 3:48

In this bonus episode, Eric and Kostas preview their upcoming conversation with Sandy Ryza of Dagster.

machine learning kostas orchestrator ryza dagster

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Data Engineering Podcast

Play Episode Listen Later Dec 11, 2023 49:51

Summary If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). That's three free boards at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics Interview Introduction How did you get involved in the area of data management? Can you describe what Anomstack is and the story behind it? What are your goals for this project? What other tools/products might teams be evaluating while they consider Anomstack? In the context of Anomstack, what constitutes a "metric"? What are some examples of useful metrics that a data team might want to monitor? You put in a lot of work to make Anomstack as easy as possible to get started with. How did this focus on ease of adoption influence the way that you approached the overall design of the project? What are the core capabilities and constraints that you selected to provide the focus and architecture of the project? Can you describe how Anomstack is implemented? How have the design and goals of the project changed since you first started working on it? What are the steps to getting Anomstack running and integrated as part of the operational fabric of a data platform? What are the sharp edges that are still present in the system? What are the interfaces that are available for teams to customize or enhance the capabilities of Anomstack? What are the most interesting, innovative, or unexpected ways that you have seen Anomstack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomstack? When is Anomstack the wrong choice? What do you have planned for the future of Anomstack? Contact Info LinkedIn (https://www.linkedin.com/in/andrewm4894/) Twitter (https://twitter.com/@andrewm4894) GitHub (http://github.com/andrewm4894) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Anomstack Github repo (http://github.com/andrewm4894/anomstack) Airflow Anomaly Detection Provider Github repo (https://github.com/andrewm4894/airflow-provider-anomaly-detection) Netdata (https://www.netdata.cloud/) Metric Tree (https://www.datacouncil.ai/talks/designing-and-building-metric-trees) Semantic Layer (https://en.wikipedia.org/wiki/Semantic_layer) Prometheus (https://prometheus.io/) Anodot (https://www.anodot.com/) Chaos Genius (https://www.chaosgenius.io/) Metaplane (https://www.metaplane.dev/) Anomalo (https://www.anomalo.com/) PyOD (https://pyod.readthedocs.io/) Airflow (https://airflow.apache.org/) DuckDB (https://duckdb.org/) Anomstack Gallery (https://github.com/andrewm4894/anomstack/tree/main/gallery) Dagster (https://dagster.io/) InfluxDB (https://www.influxdata.com/) TimeGPT (https://docs.nixtla.io/docs/timegpt_quickstart) Prophet (https://facebook.github.io/prophet/) GreyKite (https://linkedin.github.io/greykite/) OpenLineage (https://openlineage.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

Play Episode Listen Later Nov 27, 2023 29:42

Summary Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to dataengineeringpodcast.com/memphis (https://www.dataengineeringpodcast.com/memphis) today to get started! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'll be sharing an update on my own journey of building a data platform, with a particular focus on the challenges of tool integration and maintaining a single source of truth Interview Introduction How did you get involved in the area of data management? data sharing weight of history existing integrations with dbt switching cost for e.g. SQLMesh de facto standard of Airflow Single source of truth permissions management across application layers Database engine Storage layer in a lakehouse Presentation/access layer (BI) Data flows dbt -> table level lineage orchestration engine -> pipeline flows task based vs. asset based Metadata platform as the logical place for horizontal view Contact Info LinkedIn (https://linkedin.com/in/tmacey) Website (https://www.dataengineeringpodcast.com) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Monologue Episode On Data Platform Design (https://www.dataengineeringpodcast.com/data-platform-design-episode-268) Monologue Episode On Leaky Abstractions (https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Trino (https://trino.io/) Dagster (https://dagster.io/) dbt (https://www.getdbt.com/) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) OpenMetadata (https://open-metadata.org/) OpenLineage (https://openlineage.io/) Data Platform Shadow IT Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121) Preset (https://preset.io/) LightDash (https://www.lightdash.com/) Podcast Episode (https://www.dataengineeringpodcast.com/lightdash-exploratory-business-intelligence-episode-232/) SQLMesh (https://sqlmesh.readthedocs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) Airflow (https://airflow.apache.org/) Spark (https://spark.apache.org/) Flink (https://flink.apache.org/) Tabular (https://tabular.io/) Iceberg (https://iceberg.apache.org/) Open Policy Agent (https://www.openpolicyagent.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

The Journey from Engineer to CEO and Lessons Learned Along the Way with Pete Hunt

That Tech Pod

Play Episode Listen Later Nov 14, 2023 29:41

Today Laura and Kevin speak with Pete Hunt. We chat about what it means to be an engineer, how to tell if an engineer is good and how you get from engineer to CEO, plus Pete weighs in on the Zuck vs Musk debate with his unique perspective having worked at both Instagram and Twitter. Pete joined Elementl as head of engineering in early 2022, and took over the reins as CEO in November of that year. Pete was previously co-founder and CEO of Smyte, an anti-abuse provider that was acquired by Twitter. Prior to this Pete led Instagram's web team, built Instagram's business analytics products, and helped to open source Facebook's React.js. Their platform is Dagster, a next-generation open source orchestration platform for the development, production, and observation of data assets.

ceo elon musk engineers react along the way zuck lessons learned along dagster pete hunt smyte

Pete Hunt, CEO of Elementl/Dagster

The Craft Of Open Source

Play Episode Listen Later Oct 31, 2023 44:12

Data is the name of the game in today's world. But with the amount of data sources today, how do you sift through and get the data you want? A data pipeline is the answer, and within that pipeline is a data orchestrator. Today's guest, Pete Hunt, is the CEO of Elementl, the company behind the open-source orchestration platform, Dagster. He joins Ben Rometsch to tell us all about Elementl and Dagster as well as his career journey that took him across Facebook, Instagram, Smyte, and Twitter. In this current fluctuating environment, we can say it is a feat for a company to be able to raise money. Just this year, Elementl was able to raise $33 million Series B for Dagster. Find out how they are able to achieve this, what they are doing for data orchestration, and where they are heading in the future. Tune in to this episode to not miss out!

ceo data series b dagster pete hunt smyte

How to Work Effectively With Your Data Teams With Nick Schrock, Founder and CTO of Dagster Labs

Data Unlocked

Play Episode Listen Later Oct 31, 2023 28:52

In this week's episode of Data Unlocked, Jason sits down with Nick Schrock, the founder and CTO of Dagster Labs.With years of experience in the engineering space, Nick has worked for some of the biggest names in the industry, such as Microsoft, Facebook, and Care Evolution.In 2018, Nick left all of that behind to found his own company, Dagster Labs, where he is also CTO.Dagster Labs is a data orchestration platform built for productivity. Dagster, their open-source product, is primarily used by data engineering teams to make sure from a marketing and business stakeholder context that data is properly transformed.On Data Unlocked, we've brought in a lot of marketers to talk about their relationship with data teams and how they drive that relationship. But in this episode, Nick and Jason will be discussing that relationship from the other side of the house to answer three main questions: Are you working effectively with your data teams?What does great collaboration look like?And what steps do you need to take to make cross-functional collaboration as effective as possible?Are you ready?Let's dive in.Key Takeaways:Intro (00:00)Meet Nick (00:44)Let's go back in history (02:37)How to have better continuity and data access (11:00)Is your data good enough? (14:35)The data quality problem (22:05)Who would you have this conversation with again? (23:07)Additional Resources:Get in contact with Nick here.Learn more about Dagster Labs here.>>Learn more about us here.Follow us on LinkedIn, Twitter, and Instagram.If you enjoyed this episode, please follow, rate, and leave a review on your favorite podcast platform!

founders data microsoft cto labs schrock data teams dagster key takeaways intro

Open source data orchestration

The Tech Trek

Play Episode Listen Later Oct 17, 2023 23:26

In this episode, Amir Bormand interviews Pete Hunt, the CEO of Dagster Labs. They discuss the open-source nature of Dagster, a product that helps businesses with data orchestration. They explore the product's benefits, the challenges in the data orchestration market, and why Dagster Labs decided to open-source their product. Pete shares his background in open source and the importance of data pipelines in making sense of messy data. Tune in to learn more about how Dagster is revolutionizing the data industry. Highlights: [00:01:02] Building with data in businesses. [00:04:08] Data hygiene in organizations. [00:08:09] Building multi-tenancy from day one. [00:14:14] Data pipeline unpredictability. [00:18:00] Open source mentality. [00:21:10] Open source led business models. [00:23:05] Open source pricing strategy. Guest: Pete joined Dagster Labs as Head of Engineering in early 2022 and took over the reins as CEO in November of that year. Pete was previously co-founder and CEO of Smyte, an anti-abuse provider that Twitter acquired. Before this, Pete led Instagram's web team, built Instagram's business analytics products, and helped to open-source Facebook's React.js. Connect with Pete: https://twitter.com/floydophone https://www.linkedin.com/in/pwhunt/

ceo head building data open engineering react orchestration open source data dagster pete hunt amir bormand smyte

158: The Orchestration Layer as the Data Platform Control Plane With Nick Schrock of Dagster Labs

The Data Stack Show

Play Episode Listen Later Oct 4, 2023 62:18

Highlights from this week's conversation include:Nick's background and journey in data (2:28)Founding Dagster Labs (7:50)The evolution of data engineering (12:32)Fragmentation in data infrastructure (15:04)The role of orchestration in data platforms (19:53)The importance of operational tools for data pipelines (25:01)Lessons learned from working with GraphQL (26:19)The role of the orchestrator in data engineering (34:51)The boundaries between data infrastructure and product engineering (37:33)Different orchestrators in the data infrastructure landscape(42:03)The role of MLOps in data engineering (46:04)Data Quality and Orchestration (51:04)Future of Data Teams and Orchestration (54:27)Final thoughts and takeaways from (58:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

lessons future labs layer fragmentation cdp orchestration graphql data quality schrock data platform data teams control plane dagster rudderstack

The PRQL: The Power of Data Orchestration: A Game-Changer for Data Infrastructure, Featuring Nick Schrock of Dagster Labs

The Data Stack Show

Play Episode Listen Later Oct 2, 2023 3:29

In this bonus episode, Eric and Kostas preview their upcoming conversation with Nick Schrock of Dagster Labs.

game changers labs orchestration kostas schrock data infrastructure dagster

E105: Bringing Great Developer Experience to Data Teams with Dagster

Open Source Startup Podcast

Play Episode Listen Later Sep 21, 2023 45:19

Nick Schrock is Founder of Dagster Labs & Creator of Dagster - the open source orchestration platform for the development, production, and observation of data assets. Dagster Labs has raised just under $50M from investors including Sequoia, Index, and Georgian Partners. In this episode, we discuss how Dagster is bringing software engineering principles to the data space, what a great developer experience means for data engineers, how to think about launching the cloud version of your open source project & much more!

founders creator data developers index sequoia 50m developer experience data teams dagster georgian partners

S8 Bonus: Pete Hunt, Dagster

Code Story

Play Episode Listen Later Sep 13, 2023 32:55

Pete Hunt grew up in NE Massachusetts, which he mentions was culturally New Hampshire. He wasn't into Hockey, but did a lot of swimming, in particular the 200m butterfly. He has a 2 year old daughter, and loves to play guitar in his cover band.Pete was one of the founding team members of React, and his cofounder, Nick, was one of the creators of GraphQL. Post Facebook, they wanted to figure out what was next, and wanted to build something impactful. After interviewing some folks, he realized that managing data and data pipelines was a challenge that needed to be solved.This is the creation story of Dagster.SponsorsCipherstashTreblleCAST AI FireflyTursoMemberstackLinksWebsite: https://dagster.io/LinkedIn: https://www.linkedin.com/in/pwhunt/Support this podcast at — https://redcircle.com/code-story/donationsAdvertising Inquiries: https://redcircle.com/brandsPrivacy & Opt-Out: https://redcircle.com/privacy

hockey new hampshire react graphql dagster pete hunt

An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem

Data Engineering Podcast

Play Episode Listen Later Sep 10, 2023 61:25

Summary Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm welcoming back Nick Schrock to talk about the state of the ecosystem for data orchestration Interview Introduction How did you get involved in the area of data management? Can you start by defining what data orchestration is and how it differs from other types of orchestration systems? (e.g. container orchestration, generalized workflow orchestration, etc.) What are the misconceptions about the applications of/need for/cost to implement data orchestration? How do those challenges of customer education change across roles/personas? Because of the multi-faceted nature of data in an organization, how does that influence the capabilities and interfaces that are needed in an orchestration engine? You have been working on Dagster for five years now. How have the requirements/adoption/application for orchestrators changed in that time? One of the challenges for any orchestration engine is to balance the need for robust and extensible core capabilities with a rich suite of integrations to the broader data ecosystem. What are the factors that you have seen make the most influence in driving adoption of a given engine? What are the most interesting, innovative, or unexpected ways that you have seen data orchestration implemented and/or used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data orchestration? When is a data orchestrator the wrong choice? What do you have planned for the future of orchestration with Dagster? Contact Info @schrockn (https://twitter.com/schrockn) on Twitter LinkedIn (https://www.linkedin.com/in/schrockn) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Dagster (https://dagster.io/) GraphQL (https://graphql.org/) K8s == Kubernetes (https://kubernetes.io/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Hightouch (https://hightouch.com/) Podcast Episode (https://www.dataengineeringpodcast.com/hightouch-customer-data-warehouse-episode-168/) Airflow (https://airflow.apache.org/) Prefect (https://www.prefect.io) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) DAG == Directed Acyclic Graph (https://en.wikipedia.org/wiki/Directed_acyclic_graph) Temporal (https://temporal.io/) Software Defined Assets (https://docs.dagster.io/concepts/assets/software-defined-assets) DataForm (https://dataform.co/) Gradient Flow State Of Orchestration Report 2022 (https://gradientflow.com/2022-workflow-orchestration-survey/) MLOps Is 98% Data Engineering (https://mlops.community/mlops-is-mostly-data-engineering/) DataHub (https://datahubproject.io/) Podcast Episode (https://www.dataengineeringpodcast.com/datahub-metadata-management-episode-147/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

Data Engineering Podcast

Play Episode Listen Later Sep 4, 2023 42:12

Summary Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Your host is Tobias Macey and today I'm interviewing Adrian Brudaru about dlt, an open source python library for data loading Interview Introduction How did you get involved in the area of data management? Can you describe what dlt is and the story behind it? What is the problem you want to solve with dlt? Who is the target audience? The obvious comparison is with systems like Singer/Meltano/Airbyte in the open source space, or Fivetran/Matillion/etc. in the commercial space. What are the complexities or limitations of those tools that leave an opening for dlt? Can you describe how dlt is implemented? What are the benefits of building it in Python? How have the design and goals of the project changed since you first started working on it? How does that language choice influence the performance and scaling characteristics? What problems do users solve with dlt? What are the interfaces available for extending/customizing/integrating with dlt? Can you talk through the process of adding a new source/destination? What is the workflow for someone building a pipeline with dlt? How does the experience scale when supporting multiple connections? Given the limited scope of extract and load, and the composable design of dlt it seems like a purpose built companion to dbt (down to the naming). What are the benefits of using those tools in combination? What are the most interesting, innovative, or unexpected ways that you have seen dlt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt? When is dlt the wrong choice? What do you have planned for the future of dlt? Contact Info LinkedIn (https://www.linkedin.com/in/data-team/?originalSubdomain=de) Join our community to discuss further (https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links dlt (https://dlthub.com/) Harness Success Story (https://dlthub.com/success-stories/harness/) Our guiding product principles (https://dlthub.com/product/) Ecosystem support (https://dlthub.com/docs/dlt-ecosystem) From basic to complex, dlt has many capabilities (https://dlthub.com/docs/getting-started/build-a-data-pipeline) Singer (https://www.singer.io/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Meltano (https://meltano.com/) Podcast Episode (https://www.dataengineeringpodcast.com/meltano-data-integration-episode-141/) Matillion (https://www.matillion.com/) Podcast Episode (https://www.dataengineeringpodcast.com/matillion-cloud-data-integration-episode-286/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) OpenAPI (https://www.openapis.org/) Data Mesh (https://martinfowler.com/articles/data-monolith-to-mesh.html) Podcast Episode (https://www.dataengineeringpodcast.com/data-mesh-revisited-episode-250/) SQLMesh (https://sqlmesh.com/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) Airflow (https://airflow.apache.org/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-data-platform-big-complexity-episode-239/) Prefect (https://www.prefect.io/) Podcast Episode (https://www.dataengineeringpodcast.com/prefect-workflow-engine-episode-86/) Alto (https://github.com/z3z1ma/alto) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

122: Engineering Hard Choices with Pete Hunt | The Face of Facebook's React.js | CEO & Co-Founder of Twitter-acquired Startup "Smyte"

The Happy Engineer

Play Episode Listen Later Aug 24, 2023 54:01

Imagine you work at Instagram when there are “only” 100 million users, and you need to make a technology decision that will help scale to 2 BILLION and beyond. What do you do? Listen now for this amazing answer and more! ============================ When you're ready, here are three ways I can help you build your engineering career: 1. Grab my eBook … 49 tips you can apply immediately to stand out and move up, without any fancy degrees or certifications. 2. Join us at Happy Hour … my LIVE monthly workshop where we dig deep into career growth strategies and provide 1:1 open coaching for you at the end of the session. 3. Apply for the Lifestyle Engineering Blueprint™️ … get a free Career Growth Audit™️ and work with me and my team privately in our intensive coaching program, exclusively for engineering leaders. ============================ In this episode, we meet a leader who has gone from engineer, to CEO, then back to engineer, and back to CEO again, Pete Hunt. If you want to make an impact through technology, and can't decide if Fortune 100 or Inc. 1000 is right for you, or you struggle to make hard choices in engineering your career… then you are going to love learning from Pete. In his early career, Pete led Instagram's web team, built Instagram's business analytics products, and helped to open source Facebook's React.js (you can find him speaking at conferences around the world on YouTube). After that, he co-founded and served as CEO of Smyte, an anti-abuse provider that was acquired by Twitter. During these experiences he discovered the keys to building and leading a data engineering team, and making hard choices like selecting an app's tech stack. Now, Pete is the CEO of Elementl which builds Dagster, an open-source data orchestration tool. So press play and let's chat… if 2 billion users are benefitting from Pete's hard choices, you can too. ============================ HAPPY ENGINEER COMMUNITY LINKS: > Full Show Notes, Resources, & More > Join our Facebook Group! Get access to bonus content and live coaching as growth-minded leaders build careers together. ============================ WANT MORE AMAZING GUESTS? “I love Zach and these amazing guests on The Happy Engineer Podcast.” If that sounds like you, please consider following, rating and reviewing the show! I know it's a huge favor to ask, but when you follow, leave a 5-star rating, and add an honest review of how these episodes are helping you… it's a massive benefit for getting the attention of big name powerhouse guests on this show. On Apple Podcasts, click our show, scroll to the bottom, tap to rate with 5-stars, and select “Write a Review.” Thank you so much. ============================ Connect with your host, Zach White: LinkedIn (primary) Instagram Facebook YouTube

ceo live co founders write fortune startups engineering billion ebooks react happy hour acquired hard choices reactjs on apple podcasts more join dagster happy engineer podcast pete hunt smyte lifestyle engineering blueprint

Data Orchestration, Dagster, and parallels to React.js with Pete Hunt – CEO of Elementl

What's New In Data

Play Episode Listen Later Aug 3, 2023 35:20

Pete Hunt is the CEO of Elementl. Elementl is the company behind Dagster – a popular Data Orchestration framework. Pete Hunt is also very well known for his leadership developing React.js which transformed the way modern front-end applications are built using functional programming and asset aware re-rendering. Pete talks about how Data Orchestration can also be optimized to be aware of the assets it is computing. Why is this important? As data teams scale the delivery of production data products to the business, efficient compute is critical to meet business SLAs while managing cost and underlying resources. Follow Pete Hunt on LinkedinFollow Pete Hunt on TwitterLearn more about ElementlWhat's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

ceo data react parallels orchestration slas reactjs dagster pete hunt

Drill to Detail Ep.109 'Dagster, Orchestration and Software-Defined Assets' with Special Guest Nick Schrock

Drill to Detail

Play Episode Listen Later Aug 3, 2023 53:42

Mark Rittman is joined in this episode by returning guest and Elementl Founder Nick Shrock to talk about Dagster's role in the modern data stack ecosystem and software defined assets, a new, declarative approach to managing data and orchestrating its maintenance.Introducing Software-Defined AssetsRethinking Orchestration as Reconciliation: Software-Defined Assets in DagsterOptimizing Data Materialization Using Dagster's PoliciesHow I use Dagster to orchestrate the production of social science data assets

assets detail drill orchestration schrock software defined dagster mark rittman

Strategies For A Successful Data Platform Migration

Data Engineering Podcast

Play Episode Listen Later Jul 31, 2023 69:52

Summary All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex (https://www.dataengineeringpodcast.com/hex) to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack Interview Introduction How did you get involved in the area of data management? A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation? Is it possible to completely avoid having to invest in a migration? What are the signals that point to the need for a migration? What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one) What are some signals that a migration is not the right solution for a perceived problem? Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution? What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)? What are some of the ways that a migration effort might fail? What are the major pitfalls that teams need to be aware of as they work through a data platform migration? What are the opportunities for automation during the migration process? What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations? What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons? Contact Info Gleb LinkedIn (https://www.linkedin.com/in/glebmezh/) @glebmm (https://twitter.com/glebmm) on Twitter Rob LinkedIn (https://www.linkedin.com/in/robertgoretsky/) RobGoretsky (https://github.com/RobGoretsky) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Datafold (https://www.datafold.com/) Podcast Episode (https://www.dataengineeringpodcast.com/datafold-proactive-data-quality-episode-205/) Informatica (https://www.informatica.com/) Airflow (https://airflow.apache.org/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Redshift (https://aws.amazon.com/redshift/) Eventbrite (https://www.eventbrite.com/) Teradata (https://www.teradata.com/) BigQuery (https://cloud.google.com/bigquery) Trino (https://trino.io/) EMR == Elastic Map-Reduce (https://aws.amazon.com/emr/) Shadow IT (https://en.wikipedia.org/wiki/Shadow_IT) Podcast Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121) Mode Analytics (https://mode.com/) Looker (https://cloud.google.com/looker/) Sunk Cost Fallacy (https://en.wikipedia.org/wiki/Sunk_cost) data-diff (https://github.com/datafold/data-diff) Podcast Episode (https://www.dataengineeringpodcast.com/data-diff-open-source-data-integration-validation-episode-303/) SQLGlot (https://github.com/tobymao/sqlglot) Dagster (dhttps://dagster.io/) dbt (https://www.getdbt.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Build Better Tests For Your dbt Projects With Datafold And data-diff

Data Engineering Podcast

Play Episode Listen Later Jun 11, 2023 48:21

Summary Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about how to test your dbt projects with Datafold Interview Introduction How did you get involved in the area of data management? Can you describe what Datafold is and what's new since we last spoke? (July 2021 and July 2022 about data-diff) What are the roadblocks to data testing/validation that you see teams run into most often? How does the tooling used contribute to/help address those roadblocks? What are some of the error conditions/failure modes that data-diff can help identify in a dbt project? What are some examples of tests that need to be implemented by the engineer? In your experience working with data teams, what typically constitutes the "staging area" for a dbt project? (e.g. separate warehouse, namespaced tables, snowflake data copies, lakefs, etc.) Given a dbt project that is well tested and has data-diff as part of the validation suite, what are the challenges that teams face in managing the feedback cycle of running those tests? In application development there is the idea of the "testing pyramid", consisting of unit tests, integration tests, system tests, etc. What are the parallels to that in data projects? What are the limitations of the data ecosystem that make testing a bigger challenge than it might otherwise be? Beyond test execution, what are the other aspects of data health that need to be included in the development and deployment workflow of dbt projects? (e.g. freshness, time to delivery, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Datafold and/or data-diff used for testing dbt projects? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt testing internally or with your customers? When is Datafold/data-diff the wrong choice for dbt projects? What do you have planned for the future of Datafold? Contact Info LinkedIn (https://www.linkedin.com/in/glebmezh/) Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Datafold (https://www.datafold.com/) Podcast Episode (https://www.dataengineeringpodcast.com/datafold-proactive-data-quality-episode-205/) data-diff (https://github.com/datafold/data-diff) Podcast Episode (https://www.dataengineeringpodcast.com/data-diff-open-source-data-integration-validation-episode-303/) dbt (https://www.getdbt.com/) Dagster (https://dagster.io/) dbt-cloud slim CI (https://docs.getdbt.com/blog/intelligent-slim-ci) GitHub Actions (https://github.com/features/actions) Jenkins (https://www.jenkins.io/) Circle CI (https://circleci.com/) Dolt (https://github.com/dolthub/dolt) Malloy (https://github.com/malloydata/malloy) LakeFS (https://lakefs.io/) Planetscale (https://planetscale.com/) Snowflake Zero Copy Cloning (https://www.youtube.com/watch?v=uGCpwoQOQzQ) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/) Special Guest: Gleb Mezhanskiy.

confidence data testing projects tests jenkins python diff hug build better malloy github actions circleci dolt planetscale dagster data validation freak fandango orchestra datafold

What Happens When The Abstractions Leak On Your Data

Data Engineering Podcast

Play Episode Listen Later May 15, 2023 26:41

Summary All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow Interview Introduction impact of community tech debt hive metastore new work being done but not widely adopted tensions between automation and correctness data type mapping integer types complex types naming things (keys/column names from APIs to databases) disaggregated databases - pros and cons flexibility and cost control not as much tooling invested vs. Snowflake/BigQuery/Redshift data modeling dimensional modeling vs. answering today's questions What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform? When is ELT the wrong choice? What do you have planned for the future of your data platform? Contact Info LinkedIn (https://www.linkedin.com/in/tmacey/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links dbt (https://www.getdbt.com/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Trino (https://trino.io/) Podcast Episode (https://www.dataengineeringpodcast.com/presto-distributed-sql-episode-149/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=5c0e333f6088) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) Redshift (https://aws.amazon.com/redshift/) Technical Debt (https://en.wikipedia.org/wiki/Technical_debt) Hive Metastore (https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration) AWS Glue (https://aws.amazon.com/glue/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

data technical leak python snowflakes apis hug extract abstraction elt redshift technical debt trino bigquery data lakehouse dagster freak fandango orchestra

Pete Hunt: The No-Fun Rule

Full Stack Whatever

Play Episode Listen Later Feb 21, 2023 67:18

Pete is a software engineer, startup founder, and one of the creators of React. We talked through his career, from Facebook, to Instagram, to starting his company Smyte, to their acquisition by Twitter, and now his new role at Elementl. You'll find a ton of never previously shared stories and tidbits in this one.

react no fun dagster pete hunt smyte

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

Play Episode Listen Later Feb 6, 2023 32:21

Summary This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years Interview Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role Followed on from hype about "data science" Hadoop era Streaming Lambda and Kappa architectures Not really referenced anymore "Big Data" era of capture everything has shifted to focusing on data that presents value Regulatory environment increases risk, better tools introduce more capability to understand what data is useful Data catalogs Amundsen and Alation Orchestration engine Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything Data catalog -> data discovery -> active metadata Business intelligence Read only reports to metric/semantic layers Embedded analytics and data APIs Rise of ELT dbt Corresponding introduction of reverse ETL What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast? Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

reflecting big data lyft luigi python hug orchestration data engineering prefect airflow dagster freak fandango orchestra

Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

Data Engineering Podcast

Play Episode Listen Later Jan 8, 2023 44:05

Summary Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries Interview Introduction How did you get involved in the area of data management? Can you describe what the SQLake product is and the story behind it? What is the core problem that you are trying to solve? What are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow? What are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)? Can you describe the technical implementation of the SQLake feature? What does the workflow look like for designing and deploying pipelines in SQLake? What are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales? SQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling? What are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales? What are some of the edge cases that you have had to provide escape hatches for? What are the most interesting, innovative, or unexpected ways that you have seen SQLake used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLake? When is SQLake the wrong choice? What do you have planned for the future of SQLake? Contact Info LinkedIn (https://www.linkedin.com/in/ori-rafael-91723344/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Upsolver (https://www.upsolver.com/) Podcast Episode (https://www.dataengineeringpodcast.com/upsolver-streaming-data-integration-episode-240/) SQLake (https://docs.upsolver.com/sqlake/) Airflow (https://airflow.apache.org/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Prefect (https://www.prefect.io/) Podcast Episode (https://www.dataengineeringpodcast.com/prefect-workflow-engine-episode-86/) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) GitHub Actions (https://github.com/features/actions) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) PartiQL (https://partiql.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

Data Engineering Podcast

Play Episode Listen Later Dec 26, 2022 71:59

Summary Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm being interviewed by Scott Hirleman about my work on the podcasts and my experience building a data platform Interview Introduction How did you get involved in the area of data management? Data platform building journey Why are you building, who are the users/use cases How to focus on doing what matters over cool tools How to build a good UX Anything surprising or did you discover anything you didn't expect at the start How to build so it's modular and can be improved in the future General build vs buy and vendor selection process Obviously have a good BS detector - how can others build theirs So many tools, where do you start - capability need, vendor suite offering, etc. Anything surprising in doing much of this at once How do you think about TCO in build versus buy Any advice Guest call out Be brave, believe you are good enough to be on the show Look at past episodes and don't pitch the same as what's been on recently And vendors, be smart, work with your customers to come up with a good pitch for them as guests... Tobias' advice and learnings from building out a data platform: Advice: when considering a tool, start from what are you actually trying to do. Yes, everyone has tools they want to use because they are cool (or some resume-driven development). Once you have a potential tool, is the capabilty you want to use a unloved feature or a main part of the product. If it's a feature, will they give it the care and attention it needs? Advice: lean heavily on open source. You can fix things yourself and better direct the community's work than just filing a ticket and hoping with a vendor. Learning: there is likely going to be some painful pieces missing, especially around metadata, as you build out your platform. Advice: build in a modular way and think of what is my escape hatch? Yes, you have to lock yourself in a bit but build with the possibility of a vendor or a tool going away - whether that is your choice (e.g. too expensive) or it literally disappears (anyone remember FoundationDB?). Learning: be prepared for tools to connect with each other but the connection to not be as robust as you want. Again, be prepared to have metadata challenges especially. Advice: build your foundation to be strong. This will limit pain as things evolve and change. You can't build a large building on a bad foundation - or at least it's a BAD idea... Advice: spend the time to work with your data consumers to figure out what questions they want to answer. Then abstract that to build to general challenges instead of point solutions. Learning: it's easy to put data in S3 but it can be painfully difficult to query it. There's a missing piece as to how to store it for easy querying, not just the metadata issues. Advice: it's okay to pay a vendor to lessen pain. But becoming wholly reliant on them can put you in a bad spot. Advice: look to create paved path / easy path approaches. If someone wants to follow the preset path, it's easy for them. If they want to go their own way, more power to them, but not the data platform team's problem if it isn't working well. Learning: there will be places you didn't expect to bend - again, that metadata layer for Tobias - to get things done sooner. It's okay to not have the end platform built at launch, move forward and get something going. Advice: "one of the perennial problems in technlogy is the bias towards speed and action without necessarily understanding the destination." Really consider the path and if you are creating a scalable and maintainable solution instead of pushing for speed to deliver something. Advice: consider building a buffer layer between upstream sources so if there are changes, it doesn't automatically break things downstream. Tobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt Contact Info LinkedIn (https://www.linkedin.com/in/scotthirleman/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Mesh Community (https://datameshlearning.com/community/) Podcast (https://www.linkedin.com/company/80887002/admin/) OSI Model (https://en.wikipedia.org/wiki/OSI_model) Schemata (https://schemata.app/) Podcast Episode (https://www.dataengineeringpodcast.com/schemata-schema-compatibility-utility-episode-324/) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) Chris Riccomini (https://daappod.com/data-mesh-radio/devops-for-data-mesh-chris-riccomini/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

Play Episode Listen Later Dec 19, 2022 47:00

Summary One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Abe Gong about the technical and organizational implementation of data contracts Interview Introduction How did you get involved in the area of data management? Can you describe what your conception of a data contract is? What are some of the ways that you have seen them implemented? How has your work on Great Expectations influenced your thinking on the strategic and tactical aspects of adopting/implementing data contracts in a given team/organization? What does the negotiation process look like for identifying what needs to be included in a contract? What are the interfaces/integration points where data contracts are most useful/necessary? What are the discussions that need to happen when deciding when/whether a contract "violation" is a blocking action vs. issuing a notification? At what level of detail/granularity are contracts most helpful? At the technical level, what does the implementation/integration/deployment of a contract look like? What are the most interesting, innovative, or unexpected ways that you have seen data contracts used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts/great expectations? When are data contracts the wrong choice? What do you have planned for the future of data contracts in great expectations? Contact Info LinkedIn (https://www.linkedin.com/in/abe-gong-8a77034/) @AbeGong (https://twitter.com/AbeGong) on Twitter Website (https://www.abegong.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Great Expectations (https://www.abegong.com/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) Progressive Typing (https://en.wikipedia.org/wiki/Gradual_typing) Pioneers, Settlers, Town Planners (https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html) Pydantic (https://pydantic-docs.helpmanual.io/) Podcast.__init__ Episode (https://www.pythonpodcast.com/pydantic-data-validation-episode-263/) Typescript (https://www.typescriptlang.org/) Duck Typing (https://en.wikipedia.org/wiki/Duck_typing) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309) Trino (https://trino.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Build More Reliable Machine Learning Systems With The Dagster Orchestration Engine

The Machine Learning Podcast

Play Episode Listen Later Dec 2, 2022 45:43

Summary Building a machine learning model one time can be done in an ad-hoc manner, but if you ever want to update it and serve it in production you need a way of repeating a complex sequence of operations. Dagster is an orchestration engine that understands the data that it is manipulating so that you can move beyond coarse task-based representations of your dependencies. In this episode Sandy Ryza explains how his background in machine learning has informed his work on the Dagster project and the foundational principles that it is built on to allow for collaboration across data engineering and machine learning concerns. Interview Introduction How did you get involved in machine learning? Can you start by sharing a definition of "orchestration" in the context of machine learning projects? What is your assessment of the state of the orchestration ecosystem as it pertains to ML? modeling cycles and managing experiment iterations in the execution graph how to balance flexibility with repeatability What are the most interesting, innovative, or unexpected ways that you have seen orchestration implemented/applied for machine learning? What are the most interesting, unexpected, or challenging lessons that you have learned while working on orchestration of ML workflows? When is Dagster the wrong choice? What do you have planned for the future of ML support in Dagster? Contact Info LinkedIn (https://www.linkedin.com/in/sandyryza/) @s_ryz (https://twitter.com/s_ryz) on Twitter sryza (https://github.com/sryza) on GitHub Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast (https://www.dataengineeringpodcast.com) covers the latest on modern data management. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. Visit the site (https://www.themachinelearningpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com (mailto:hosts@themachinelearningpodcast.com)) with your story. To help other people find the show please leave a review on iTunes (https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243) and tell your friends and co-workers Links Dagster (https://dagster.io/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Cloudera (https://www.cloudera.com/) Hadoop (https://hadoop.apache.org/) Apache Spark (https://spark.apache.org/) Peter Norvig (https://en.wikipedia.org/wiki/Peter_Norvig) Josh Wills (https://www.linkedin.com/in/josh-wills-13882b/) REPL == Read Eval Print Loop (https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop) RStudio (https://posit.co/) Memoization (https://en.wikipedia.org/wiki/Memoization) MLFlow (https://mlflow.org/) Kedro (https://kedro.readthedocs.io/en/stable/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/kedro-data-pipeline-episode-100/) Metaflow (https://metaflow.org/) Podcast.__init__ Episode (https://www.pythonpodcast.com/metaflow-machine-learning-operations-episode-274/) Kubeflow (https://www.kubeflow.org/) dbt (https://www.getdbt.com/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Airbyte (https://airbyte.com/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

machine learning engine hitman python ml reliable orchestration hadoop cloudera apache spark learning systems rstudio peter norvig dagster freak fandango orchestra

Using AI To Transform Your Business Without The Headache Using Graft

The Machine Learning Podcast

Play Episode Listen Later Aug 16, 2022 67:33

Summary Machine learning is a transformative tool for the organizations that can take advantage of it. While the frameworks and platforms for building machine learning applications are becoming more powerful and broadly available, there is still a significant investment of time, money, and talent required to take full advantage of it. In order to reduce that barrier further Adam Oliner and Brian Calvert, along with their other co-founders, started Graft. In this episode Adam and Brian explain how they have built a platform designed to empower everyone in the business to take part in designing and building ML projects, while managing the end-to-end workflow required to go from data to production. Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Predibase is a low-code ML platform without low-code limits. Built on top of our open source foundations of Ludwig and Horovod, our platform allows you to train state-of-the-art ML and deep learning models on your datasets at scale. Our platform works on text, images, tabular, audio and multi-modal data using our novel compositional model architecture. We allow users to operationalize models on top of the modern data stack, through REST and PQL – an extension of SQL that puts predictive power in the hands of data practitioners. Go to themachinelearningpodcast.com/predibase today to learn more and try it out! Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started! Your host is Tobias Macey and today I’m interviewing Brian Calvert and Adam Oliner about Graft, a cloud-native platform designed to simplify the work of applying AI to business problems Interview Introduction How did you get involved in machine learning? Can you describe what Graft is and the story behind it? What is the core thesis of the problem you are targeting? How does the Graft product address that problem? Who are the personas that you are focused on working with both now in your early stages and in the future as you evolve the product? What are the capabilities that can be unlocked in different organizations by reducing the friction and up-front investment required to adopt ML/AI? What are the user-facing interfaces that you are focused on providing to make that adoption curve as shallow as possible? What are some of the unavoidable bits of complexity that need to be surfaced to the end user? Can you describe the infrastructure and platform design that you are relying on for the Graft product? What are some of the emerging "best practices" around ML/AI that you have been able to build on top of? As new techniques and practices are discovered/introduced how are you thinking about the adoption process and how/when to integrate them into the Graft product? What are some of the new engineering challenges that you have had to tackle as a result of your specific product? Machine learning can be a very data and compute intensive endeavor. How are you thinking about scalability in a multi-tenant system? Different model and data types can be widely divergent in terms of the cost (monetary, time, compute, etc.) required. How are you thinking about amortizing vs. passing through those costs to the end user? Can you describe the adoption/integration process for someone using Graft? Once they are onboarded and they have connected to their various data sources, what is the workflow for someone to apply ML capabilities to their problems? One of the challenges about the current state of ML capabilities and adoption is understanding what is possible and what is impractical. How have you designed Graft to help identify and expose opportunities for applying ML within the organization? What are some of the challenges of customer education and overall messaging that you are working through? What are the most interesting, innovative, or unexpected ways that you have seen Graft used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Graft? When is Graft the wrong choice? What do you have planned for the future of Graft? Contact Info Brian LinkedIn Adam LinkedIn Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Graft High Energy Particle Physics LHC Cruise Slack Splunk Marvin Minsky Patrick Henry Winston AI Winter Sebastian Thrun DARPA Grand Challenge Higss Boson Supersymmetry Kinematics Transfer Learning Foundation Models ML Embeddings BERT Airflow Dagster Prefect Dask Kubeflow MySQL PostgreSQL Snowflake Redshift S3 Kubernetes Multi-modal models Multi-task models Magic: The Gathering The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/[CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/?utm_source=rss&utm_medium=rss

ai interview building foundation built cruise slack hitman headaches accelerate using ai python ml ludwig snowflakes magic the gathering s3 sql kubernetes cc by sa transform your business particle das k splunk graft mysql lhc postgresql redshift prefect airflow ml ai kinematics foundation models marvin minsky sebastian thrun supersymmetry ai winter dagster darpa grand challenge pql freak fandango orchestra predibase brian calvert horovod

61 (S2E19). Reverse ETL, проблемы в cloud и расточительство пакетных менеджеров

Data Coffee

Play Episode Listen Later Aug 13, 2022 63:21

Ведущие подкаста "Data Coffee" обсуждают новости и делятся своими мыслями! Shownotes: 4:06 Пожар на складе озона 9:50 Python-клиент для airflow api 11:16 Dagster 1.0 11:58 Reverse etl 16:42 'лучший' браузер для винды 23:40 Стриминг экселя 32:47 Мнения разработчиков: проблемы cloud providers 37:16 Gitlab собирается удалять проекты на бесплатных... 38:50 Superset 2.0.0 44:50 Аналитика загрузок одного пакета с npmjs.com 47:57 Китайцы силой мысли управляют домом 51:49 Робот или человек по ту сторону экрана 55:34 Японские учёные обнаружили червей-паразитов, сп... 59:31 Flipper zero — ксерокс радиосигналов или «тамаг... Обложка - Unknown authorUnknown author, CC0, via Wikimedia Commons Сайт: https://datacoffee.link, канал в Telegram: https://t.me/datacoffee, профиль в Twitter: https://twitter.com/_DataCoffee_ Чат подкаста, где можно предложить темы для будущих выпусков, а также обсудить эпизоды: https://t.me/datacoffee_chat

cloud telegram reverse python flipper gitlab wikimedia commons cc0 supersets reverse etl dagster

Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster

Data Engineering Podcast

Play Episode Listen Later Jul 24, 2022 58:14

The current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools' dependency graphs into one place through its "software defined assets" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster's 1.0 release, and the new features coming with Dagster Cloud's general availability.

data assets stack orchestration bundling software defined dagster

Software-Defined Assets

The Data Exchange with Ben Lorica

Play Episode Listen Later Jun 23, 2022 40:49

Nick Schrock is founder and Elementl, the startup behind Dagster, a popular open source, data orchestration platform. We discussed recent trends in data engineering and infrastructure, and Dagster's introduction of software-defined assets, a new approach to managing, maintaining, and orchestrating data declaratively.Download the FREE Report: State of Workflow Orchestration → https://gradientflow.com/2022-workflow-orchestration-survey/?utm_source=gradientflow&utm_medium=DEpodcastSubscribe: Apple • Android • Spotify • Stitcher • Google • AntennaPod • RSS.Detailed show notes can be found on The Data Exchange web site

assets detailed software defined dagster

Modern Data Stack: Technology, Methodology, or both? w/ Nick Schrock

Catalog & Cocktails

Play Episode Listen Later Dec 2, 2021 59:54

The modern data stack is often defined by the type of technologies that exist within it. Cloud-based, open source, low/no code tools, ELT, and reverse ETL. But surely there's more to it… isn't there? What holds the modern data stack together and makes it the architecture of choice for so many data-driven enterprises? Join Tim, Juan and special guest, Nick Schrock, founder of Elemental and creator of Dagster and GraphQL, to chat about all things MDS. This episode will feature: Is modern data stack a methodology or a set of disparate cloud technologies? Thoughts on consolidation among MDS tools Describe your reaction upon glancing at Matt Turck's latest data landscape diagram

technology data modern cloud stack methodology elemental mds graphql elt etl schrock modern data stack dagster matt turck

Laying The Foundation For The Era Of Big Complexity With Dagster

Data Engineering Podcast

Play Episode Listen Later Nov 20, 2021 65:25

The technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform.

foundation complexity laying dagster

Great Expectations with Abe Gong and Kyle Eaton

Contributor

Play Episode Listen Later Oct 27, 2021 32:18

Eric Anderson (@ericmander) interviews Abe Gong (@AbeGong) and Kyle Eaton (@SuperCoKyle) about Great Expectations, the open-source framework that aims to create a shared standard for data quality. Abe is a core contributor to the project, and the CEO and co-founder of Superconductive, the team backing Great Expectations. Kyle is Growth Lead at Superconductive, and Community Manager of Great Expectations. The team at Superconductive have just launched the new Expectation Gallery to connect contributors and carve out vertical spaces in this ecosystem. Tune in to find out why Great Expectations is the leading open-source project for eliminating pipeline debt. In this episode we discuss: How the Expectation Gallery enables new modes of community engagement Superconductive's pivot from healthcare data consulting to open-source data validation Collaborative conversations with other data companies Abe's advice to future open-source founders on segmenting value The vision of Great Expectations as a protocol-level open standard Links: Great Expectations Superconductive Down with Pipeline debt Cascade Data Labs Flyte Dagster Databricks pandas People mentioned: James Campbell (@jpcampbell42) Other episodes: Dagster with Nick Schrock

ceo pipeline collaborative great expectations abe eaton gong community managers eric anderson james campbell dagster

[Weekend Drop] Sunil Pai: React and the Meta of the Web

The Swyx Mixtape

Play Episode Listen Later Sep 19, 2021 80:04

A wideranging convo with Sunil covering the future of React, the Third Age of JavaScript, and the Meta of online discourse.Watch on YouTube: https://www.youtube.com/watch?v=H3h1WICelqsFollow Sunil: https://twitter.com/threepointoneChapters: [00:01:40] React and Temporal, Declarative vs Imperative My Temporal Explainer: https://twitter.com/swyx/status/1417165270641045505 https://www.solidjs.com/ [00:12:57] State Charts and Lucylang https://lucylang.org/ XState and Stately https://stately.ai/viz [00:17:08] The Future of React [00:25:03] React Streaming Server Rendering vs SSR/JAMstack/DSG/DPR/ISR ReactDOMServer.renderToNodeStream() Sunil's Slides: https://www.icloud.com/keynote/0MyOJkDIOVfFit76PqJFLvPVg#react-advanced https://react-lazy.coolcomputerclub.com/ [00:33:13] Next.js and the Open Source Commons [00:38:46] The Third Age of JavaScript Third Age of JS Benedict Evans (not Sinofsky) on Word Processors: https://www.ben-evans.com/benedictevans/2020/12/21/google-bundling-and-kill-zones [00:45:16] ESbuild vs SWC vs BunBun (Jarred Sumner) https://twitter.com/jarredsumner/status/1390084458724741121 [00:50:46] Let Non-X Do X: Figma vs Canva, Webflow vs Wix/Squarespace Canva vs Figma valuations https://twitter.com/swyx/status/1438102616156917767 [00:52:42] JavaScript Twitter and Notion's 9mb Marketing Site Notion 9mb JS Site Tweet mrmrs' Components.ai [01:06:33] React Server Components and Shopify Hydrogen/Oxygen https://twitter.com/swyx/status/1410103013885108229 [01:09:18] Categorical Imperatives of Web Platforms: Cloudflare vs AWS, MongoDB vs Auth0, Gatsby vs Netlify https://auth0.com/blog/introducing-auth0-actions/ [01:18:34] Wrap-up Transcript [00:01:40] React and Temporal, Declarative vs Imperative [00:01:40] swyx: Okay. So the first topic we want to talk about is React and Temporal, right? [00:01:43] Sunil Pai: I feel Temporal is introducing a shift into the workflow ecosystem, which is very similar to the one that React introduced to the JavaScript framework system. [00:01:54] swyx: That's the hope. I don't know if like my explanation of Temporal has reached everybody or has reached you. There are three core opinions, right? The first is that whenever you cross system boundaries, when you call it external API. So when you call internal microservices, there's a chance of failure and that multiplies, the more complex the system gets. [00:02:11] So you need a central orchestrator that holds all the retry states and logic, as well as timers And it tracks all the events and is able to resume from it from failure. [00:02:21] Second opinion that you should have is you should do event sourcing rather than try to just write your business logic and then instrument with observability logs after the fact you should have your logs as the source of truth. And if it's not in the log, it did not happen. [00:02:34] And then the final piece is the workflows as code, which is the one that you're focusing on, which is the programming model, in the sense that like all the other competitive workflow engines, like, Amazon step functions, Apache airflow, Dagster, like there's a bunch in this category. [00:02:48] They're all sort of JSON and YML DSLs, and the bind that you find yourself in is that basically you're reinventing a general purpose programming language inside of these JSON and YML DSLs because you find a need for loops, branching, variables functions, all the basic stuff. And, people find that like at the end of the day, all this tooling is available, you just have to make it run in inside of a general purpose programming language. So that's what Temporal offers. [00:03:12] But it's very interesting because it kind of straddles the imperative versus declarative debate, right? [00:03:17] React, people view as declarative. And I think it's mostly declarative, like there's imperative escape hatches, and because it's declarative, people can have a single sort of render model of their entire app for the entire tree. And I think it makes sense to them. [00:03:32] And you're saying that that's better, right? That's better than the imperative predecessor of like jQuery and randomly hooking up stuff and not having things tied up together. You sounded like you want it to [00:03:42] Sunil Pai: interrupt. So it's actually two things. One is the jQuery had an imperative API, and then they went way too hard into the declarative side with templating languages and then started reinventing stuff there. [00:03:54] So really react was like, no, you need access to an imperative language to create, you need a fully featured programming language to generate description trees like Dom trees or in this case, a workflow graphs. [00:04:10] swyx: Got it. So it's kind of like a halfway solution, maybe, maybe anyway. So the problem with us is that we're trying to say that imperative is better than declarative, for the purposes of expressing general purpose business logic, which is an interesting sell for me because in all other respects, I'm very used to arguing to declarative is better. [00:04:33] Then there's also an idea that people should build declarative layers on top of us. And I, it's just a very interesting, like back and forth between declarative and imperative that I don't know where I really stands apart from like, wherever we are is never good enough. So we need to add another layer to solve the current problems [00:04:51] Sunil Pai: there. [00:04:51] So there's a phrase for it and I forget what it's called the mechanism. It says that, uh, the system that allows you to execute stuff should not be the same system that prevents you from doing bad things. So there's a core, which is basically a fully featured API. And then you put guard rails around like the experiences. [00:05:12] For example, as an example, this is like adding TypeScript on top of JavaScript, let's say, unlike reason ML, let's say like, OCAML or a lot of very strongly type a language where if your code doesn't compile, you can't really run the code in TypeScript. There are times when you're like, you know what? [00:05:29] I need an escape hatch to actually like, do something like really funky here, X, Y, and Z, that that's not even well expressed in either the type system or sometimes even the language itself. You need to like hack it. And like, you might even email a couple of things. Uh, and in react, this was, I think when react came. [00:05:47] It wasn't just that it was a, oh, like there's JSX. It was very much, uh, okay. Uh, I have a lot of existing code, so I can add, React to one part of it and then hook onto the DOM, it renders and then have like this whole jQuery widget that I would like render onto the thing. Uh, so it gave you this whole incremental part to adopting the system, but then like after a point, like react consumes all of it. [00:06:11] And the fuck up with react is if you go too hard into react, doing stuff like animations is like impossible, which is why like we are at least a year or two away from a good animation API in React, or while you use, Framer or whatever Framer has become right now. Like frame of [00:06:27] swyx: motion. No. Um, [00:06:31] Sunil Pai: Yeah, but he's working. [00:06:32] I think Matt is now working on like a new, new thing. That's got a really funky name. Like, it sounds like a robot or something. All right. But it was curious to me that React's biggest deal was that, Hey, like, They talk about it being declarative, but a whole lot of things you wrote were like in regular-ass JavaScript, you would say on click and get an event and start doing things [00:06:53] swyx: beautiful. [00:06:54] It's a perfect blend. [00:06:56] Sunil Pai: Right. And you would suffer with this in. So there was the jQuery prototype phase, which was like directly imperative. And then they went hard in the other direction with type templating languages, like Jade and dust. And, uh, there were a number of popular ones at the time. And that's when like even Angular 1 became super popular because they're like, here's the whole kit and caboodle full whole framework. [00:07:18] And then React came and said, oh, well just the view. But that's because they didn't want to release like really yet. And they were like, yeah, this is all you need and the whole ecosystem. But anyway, so in temporal temporal for me is particularly interesting for that because it is now clearly making that. [00:07:35] I hate the phrase, but it's a good one. The paradigm shift of like how you start thinking about these systems and you just write some fucking code and then like you start adding on bits and guardrails for the things you want to do, which is on for the few hours I spent going through the docs and failing to get it running on my laptop. [00:07:53] That's my understanding of it. Feel free to correct me. [00:07:56] swyx: Okay. Yeah. And I think you're right, actually, I'll try this messaging on you because, it's something that we're consciously designing for. In fact, I have a, one of my API proposals was, reacts like API for tempo. And so essentially what we enable you to do is bundle up each individual service or job into a component that we happen to call workflow. [00:08:21] And my struggle here is that I currently tie component to workflow because what is the component like? It's, it's something that's self-contained that is a deterministic. Like it has a strict rule of execution from top to bottom, right. It just does the same thing every single time, uh, where we differ and why I struggle with this is because we put all the side effects into things that we call activities. [00:08:44] That's where all the non-deterministic stuff goes. And that one gets retried, basically at Temporal's will and essentially Temporal is serving as the central runtime or framework that has knowledge of all these workflows and activities. And can re-render them based on its internal rules, I retries timeouts, uh, heartbeats, all that good stuff. So I struggled with things like, which is the component and which is the hook or the effect. [00:09:08] And then there's other concepts. So, uh, we have ways to send signals into individual workflows, right? That's a very important property of the system that you can send data in while it's running and you can get data out while it's running. I'm not sure that's reflected in React at all. So maybe I'm stretching the analogy too much, [00:09:24] Sunil Pai: Solid, had an answer for that the word signal. So like solid JS. This is by Ryan Carniato the Marco folks, signals are a first-class concept in the framework. Again, I haven't dived into it in detail in a while, but it feels like an important thing. And I always wondered why React actually didn't have it because props are something that you just like pass. [00:09:46] Right. And it's just a value, like if you like plot it on a graph, for example, it's, let's say if you had to have like a graph of binary values, it would be either zero or like one, and that would be the shape of the graph, but signals are something that can be like something that happens and yeah, just pops up and goes down, like pressing a key on the keyboard. [00:10:06] And that's actually not so easy to define in a, in a react like system, like, uh, which is why it's kind of hard to build like audio processing graphs with like React or JSX. Um, I don't have like a good answer. I'd probably have to like hack on Temporal a little more, but the idea of like signals as a channel, through which you can like send information and having it as a first class part of the system is something that's not represented well in well, in React at least. [00:10:33] Yeah. Well, [00:10:34] swyx: isn't that in an action? For reducers [00:10:38] Sunil Pai: and event effectively. Yes. Like it's basically one of those actions. [00:10:42] swyx: The problem is that everything just ties right into the component tree instead of just having the component B and sort of isolated unit that can function independently. [00:10:50] Sunil Pai: That's the other thing, which is a workflow engine isn't a directed acyclic graph. In fact, it could have cycles, it could have cycles and it could have a number of other things, which is the [00:11:00] swyx: beautiful thing, by the way. [00:11:02] For us coding, a subscription platform literally is charged Stripe sleeps 30 days, charged Stripe again, and then infinite loop until you cancel and then you break out of the loop. [00:11:13] That's it. [00:11:13] Sunil Pai: That's awesome by the way. So I was actually thinking that someone's going to implement not someone's going to implement, uh, someone's going to use Redux saga on top of Temporal, that's what I was thinking, because then you will have generators that define like long running processes that are just talking to each other. [00:11:30] I think that would be good. CloudFlare also loved Temporal, by the way, like we were talking about it, like for awhile, they're like, oh, this is like fundamentally a new thing. And as you can imagine, some engineers were like, well, why isn't this running on workers? I'm like, I don't know why isn't it running on workers? [00:11:43] Like maybe we should get it there. [00:11:45] swyx: It is fairly heavy duty right now. We're trying to reduce that to a single binary, which could maybe run a workers. I'm not sure about the memory requirements that you guys have. It could, it's just not a priority for us based on our existing users. [00:12:00] Sunil Pai: Um, I was just, I was saying what they're saying. [00:12:04] They want everything to run on workers and I'm like, dude, it's just like one small, weird isolated like condo. [00:12:10] swyx: Ironically we also using V8 isolates for our TypeScript runtime. And that's just to make sure that people don't do non-deterministic stuff. So we did mock out everything, which is also pretty cool because whenever you use a library with, like setTimeout inside of that library, that persists to us as well. [00:12:25] So we set the durable timer. Your system can go down and we, we bring it back up and you're using our timer, not the JavaScript runtime timer, which is like just awesome. There's a trade off to that, which is, things don't work when you import them, like you would in a normal, Node.js project. [00:12:39] So most of, because you have to inject them into the environment of the V8 Isolate, you can't just randomly import stuff that as freely as you would in a normal node environment. So dependency injection and becomes a topic for us. [00:12:57] State Charts and Lucylang [00:12:57] swyx: Um, yeah. We actually clashed a little bit with David Khourshid because David is on this warpath of like everything in a state machine, right. Everything in the time-tested 40 year old JSON format that describes state machines. And we actually thought we were going to be competitive with him for a while because for him, the thing about writing imperative code is that it's prone to bugs, right? Like you can not really see the full, possibly the full span of like all the possible states that you're exposing, but in a state machine everything's explicit so he was butting heads with our founder for awhile. [00:13:31] But I think recently he decided that he is better at building on top of us than trying to compete with us on the reliability front. So that's, that's kind of an interesting evolution that has happened over the past year on this topic of declarative versus imperative. [00:13:44] I'm still like coming to terms with it. Like I'm not fully okay with it yet, but, it clearly is more expressive and that's something I am Very in favor of, and I have genuinely looked at like the workflow solution from Google, the workflow solution from Amazon, and they are literally have you write the abstract syntax tree by hand in JSON and that's just absolutely no way that that's going to work. So I'm pretty down with the imperative approach for now. [00:14:09] Sunil Pai: Well, that's, I figured at some point you will run XState on it and extent should work fairly well. I think contemporary, I don't see why it would. I think that that would actually, [00:14:19] swyx: Honestly, I'm not really sure what he's going to charge for. He's pushing the idea of state machines and making it more of a commonly accepted thing. [00:14:26] Sunil Pai: Well, his pitch isn't even state machines. It's very specifically state charts and I love state charts. I even bought the book by the way, the Ian Horrocks $700. So when I got it on Amazon, it was $180. I was like, cheap. Let's do it. I got really lucky at the time. It, it fluctuates like mad by the way that that value, well, you should expense it now is what it is. [00:14:46] Um, but, uh, what struck me about the thing? Here's what I tried. I really liked it. And I took a course, a couple of steps back and I was trying to understand, well, why isn't it like a success? Why don't people get into it? And the truth is that this falls not just into the intersection of this is the intersection of like computers and humans in the sense that sure. [00:15:07] There are things that can be correct, but there are things that can be expressible as well. Like I don't even know what code I want to write when I'm sitting down to write it. I love to like discover it while I'm writing it and really. All the syntax that we have created and abstractions, we have created around programming languages have been purely to express these things and have let's call it implicit state machines, even though that implies that it's bad. [00:15:32] Um, so for example, if you look at state charts, there's no real good way to compose two state charts together. You have to like manually start wiring them together. And like, there's, you know, like you've got in react, you say, oh, combo, if you have two components to put it together, you put like a little, uh, function around it. [00:15:49] And now it's two components in one component. So it's important not just to have a good unit of computation, but to have it like be composable with each other so that you can gather it and then make this whole nesting doll react, Dom tree of things. And I think. Until there's an actual language that supports that has state charts as a first class primitive, much like Lucy, I think that's what Matthew Phillips built. [00:16:15] He wrote a, he wrote an actual language that compares to state charts called Lucy Lang. That was very cool by the way. Like, I really like it. Uh, well, and it's fairly young, so it's too early to say whether people like love it or not. And other than, but people like you and me who look at something like, wow, this is awesome. [00:16:33] Let's all use it. No, like to take a while to grow. But I think that's the state charts has a bit of dissonance with the languages that it's written in right now, because it's not a first-class thing. I mean, it's adjacent object with keys and. Okay. Like we can do better maybe. Uh, but I would not bet against David and the people he's hiring. [00:16:53] Like he's hurting some smart people, you know, they're all like pretty intelligent. So I'm curious to see how that plays out. [00:17:00] swyx: I'm just glad that we're not competing. Uh, so that's, that's something that, that, that resolved itself without any intervention from me, which is very good. [00:17:08] The Future of React [00:17:08] swyx: Well, let's have this conversation since it's related, should React to be more of a DSL, [00:17:14] you know, this conversation that happened over this week, so I'll pull it up. [00:17:20] Sunil Pai: Uh, wait, so I've, I'm seeing, is this the whole Svelte versus React thing that's been happening over the last two, three days? [00:17:25] swyx: Yes. So basically it's saying React is already so far down almost like its own language. [00:17:30] They should just embrace it more. And instead of using linting to catch rule violations, just make a DSL, people are gonna use it. It's fine. And just like build things in so that it's impossible to make these errors that, that people commonly make. [00:17:47] Sunil Pai: So this is Mike Sherov, uh, he was smoking about it. [00:17:51] He mentioned how it shouldn't be a lint rule. And since we already have customs, insects and GSX, he should introduce a couple of other things. So as you can imagine, the react team has thought about this a lot. So the big problem with this all boils down to that fucking dependency area on use effect, by the way, that's the one that trips, everything else is fine. [00:18:09] Like you stayed all that is like fine. You can get. This is [00:18:13] swyx: what it was. Yeah. People want like state something memos and things like, you know, just build the reactor primitives into the language. [00:18:22] Sunil Pai: So yeah, I think this, this actually, isn't a bad idea and I think that was the whole deal with hooks. Whereas what's the phrase that they use in the docks. [00:18:30] A sufficiently advanced compiler might comply with these things at some point, and you're like, oh wow, great job. On pushing that responsibility onto the community, React team, well done. [00:18:41] swyx: My joke is like it's the react teams equivalent of a assume, a frictionless spherical cow from physics. [00:18:48] Sunil Pai: Exactly. [00:18:48] That's a perfectly spherical code. [00:18:54] swyx: It will exist. [00:18:57] Sunil Pai: And it's just the five of them or six of six of them hacking on this. And they have to make sure they don't break like facebook.com whenever they're working on these things. Imagine it's taken this long for Concurrent to show up and Concurrent is nice by the way. And we can talk about the server rendering API. [00:19:14] Okay. Uh, so react right now is, uh, yeah, that's the one like that. It shouldn't just be an intruder, but, uh, inside the inside Facebook only, well, not everybody can see it, but it's an in an internal, uh, uh, Facebook Wiki page, which is a list of potential F projects. You know, how the react team has fiber, whatever the hell. [00:19:47] Right? So there's a list of these projects that, or when we do this, uh, project F F I forget what the one for, uh, uh, animation that's called, is it called flat? Flat was the dumb one. And so there are lists of them and there are about 15, 20. I'm pretty sure my India has done. So Hey, so, uh, there's a list of them. [00:20:09] And if you look at them and you start assigning values in terms of work, oh, this is about six months of work. This is about, uh, another six months of work. It strikes you that there's a roadmap for about five to 10 years. At least if not more than that, I mean, look at how long it took to get like this. Of course this was very more foundational. [00:20:26] Those could probably happen a little quicker when it comes, which means the react team is like solely aware of what's missing in react right now. And to an extent that they can talk about it because if they do it becomes like a whole thing and like don't really engage in that conversation. They don't, I, I, and I don't blame them for it. [00:20:44] It's very hard to have this discourse without somebody coming in and saying, well, have you considered CSS transitions? I like that. Yes, we have. We have, we have considered CSS a lot. Uh, so, uh, so. There are all these projects like a sufficiently advanced compiler that compiles down to hooks. There's the animation API. [00:21:04] There's a welcome current, et cetera. This whole data fetching thing has been going on for years. And now it's finally starting to come to light, thankfully with collaboration, with the relay team and effectively all of the core when they built out facebook.com and, and that is the length that those are the time periods that Sebastian looks at and says, yeah, this is how we can execute on this because it can be prioritized. [00:21:33] It has to be prioritized by either Facebook wanting it or making Facebook wanted. So for example, the pitch was, Hey, let's rewrite facebook.com the desktop version because they haven't, it's a film mishmash of like hundreds of react routes on one page. It should be a single react route that does this thing. [00:21:52] Now that we have gotten management to agree to a rewrite, let us now attach it to the concurrent mode thing. And that was also part of it, which is in the older version, there was a lot of CPU fighting that used to happen between routes, which is why the whole work for the share dealers started and took like two years to like fix effectively. [00:22:08] They're doing cooperative, multitasking VM in JavaScript, which sure. When you're a Facebook, I guess you've got to like do these things. Uh, and how does that all, [00:22:18] swyx: was that ever offloaded to the browsers, by the way? Like I know there was an effort to split it out of react. [00:22:24] Sunil Pai: So I think last, I checked they were talking to Chrome literally every week. [00:22:29] Uh, but I think it's also been down to, uh, well, what Chrome wants to prioritize at the time. I think it is still going ahead again. It's the sort of work that takes years, so it's not going ahead. Nice and slowly, uh, which is why. Which is why it's architected inside react for the same reason as like it's attached to global and then read off the global. [00:22:52] I think it's also why you can't have two versions of React on the same page. There's the whole hooks thing. But also if you have two versions of React, and they'll just start fighting with each other on the scheduler, because the scheduler would yield to one than to the other than to the other one. [00:23:08] And there would be no like central thing that controls what is on the scheduling pipeline. That's from the last, again, this conversation is at least two years, or maybe they fixed that, but that's the goal of the dealer. There has to be one scheduler for the thread that everybody comes on to, and like tries to pull stuff, uh, with it. [00:23:26] I think it will become a browser API. It's just a question of like, when, like, yeah, I mean, the shared dealer in react itself has undergone so much change over the last three years. Uh, so maybe we should be glad that it isn't in the browser yet, because like, it's changed so much. It's coming there. It's I mean, the fact that they're releasing in November is a big deal. [00:23:45] swyx: You said there's so many projects that you want to ship, and the way to ship it in Facebook is to either convince them that this feature itself is worth it, or you tie it together with something else, like the Facebook, I think it's called FB5 rewrite. [00:24:00] Sunil Pai: Oh yeah. I think it's good for them. Like it worked because the Facebook, facebook.com is now more performant. Like it actually works well and they don't have CPU fighting. The fact that Facebook itself is becoming slightly irrelevant in the world is a whole other conversation. [00:24:17] swyx: Well, you know, I still use my billions, so, uh, it's it's, it improves the experience for them. [00:24:23] Sunil Pai: I'm only being snarky. [00:24:25] swyx: Uh, but I, you know, hopefully hopefully you're like, you know, there's other properties like Instagram and WhatsApp and what is, uh, which hopefully it will apply there. And then obviously like there there's the VR efforts as well. Absolutely. Yeah. [00:24:39] Sunil Pai: And that is the future. In fact, uh, several components also happened because they suddenly realized what they could do for how the deal with server components and server-side streaming rendering was never about an SSR story, or even a CEO. [00:24:54] Facebook doesn't give a fuck about SEO, right. It was about finally they figured out how to use concurrent mode to have a better UX altogether. [00:25:03] React Streaming Server Rendering vs SSR/JAMstack/DSG/DPR/ISR [00:25:03] Sunil Pai: So, okay. I should probably just keep Server components aside for right now. [00:25:06] And I'll just talk about the new streaming rendering API. Okay. [00:25:09] Okay. So I know there's like about three styles of rendering. [00:25:14] I say legacy, but legacy is such a dirty word. I don't mean it in the form that it's old it's in fact, [00:25:20] swyx: traditionally, like, sorry. [00:25:24] Sunil Pai: Uh, heritage Facebook would say heritage, it's a heritage style rendering, um, which is the, Hey, you use something like a rails or spring or some, it could be node as well. And you spit out a bunch of HTML and then you progressively enhance it with sprinkling JavaScript, pick your metaphor there like three or four metaphors that you could use. [00:25:44] Uh, uh, web components actually falls square into this, where it just comes to life only on the browser and then like make stuff interactive. Uh, then there's the whole client fully client side rendered one. So this is create react app or, well, a number of like smaller players then there's server side rendered. [00:26:04] And so as I rendered is actually like, it's not just next year. It's also your Gatsby. I feel like pretty much every, uh, react framework now has some kind of service side rendering story. Okay. So the next slide goes into what types of server-side rendering things happen. [00:26:20] swyx: there are a lot of subdivisions within here, right? [00:26:22] Like, uh, Gatsby is up here trying to reinvent like D S R D P R or something like that, which is like deferred, [00:26:29] Sunil Pai: static, [00:26:32] swyx: DSG, deferred static generation. That's the one. My former employer, Netlify also DPR, and is all, these is all like variations of this stuff with, [00:26:41] Sunil Pai: like, it's a question of where you put the cache is what it is. [00:26:46] It's a TLA three letter acronym to decide where you put the caching in. [00:26:49] Yeah, so there's the whole JAMstack and that's like the whole Netlify story, but also CloudFlare pages, or even GitHub pages. [00:26:56] There's no real runtime server rendering. You just generate a bunch of static assets and you Chuck it and it just works. Then there's fully dynamic, which would be next JS without any caching. Right? Like every request gets server-side rendered then like a bundle loads on top of it. And, um, like suddenly makes it alive, like sort of like it hydrates it. [00:27:16] And then after that it's effectively a fully clients rendered application then there's okay. So I just said ISR, but like you said, there are like three or four after this as well. There's this whole DSP. Yeah. Oh wait. So the new streaming API is actually fundamentally new because. I don't know if people even know this, but react already has a streaming rendering API. [00:27:37] It's called a render to node stream. I think that's the API for it. And the reason that that exists is so that, uh, only for a performance thing on the server where otherwise synchronous renders would block like other requests. And it would make like if for a server that was very, uh, uh, there was heavily trafficked. [00:27:57] It would become like really slow. So at least with the streaming API, yeah. That's the one learner to notice the stream, at least with this one, it wouldn't clash and you could interleave requests from there happening, but it didn't solve like anything else, like nothing, you couldn't actually do anything asynchronous on it, which is kind of that fucking sucks because like, it looks like it's an asynchronous API, but you can't do anything asynchronous through it. [00:28:18] It's the only thing that, okay, so vendor to readable stream is cool because I can, even if you go to the very last slide last bit, once. You know what this is, where the very first link open it up. Like it says react, lazy.cool computer club. So this is the demo that they have that exists with this new API. [00:28:36] This is what they link to. So if you refresh it a couple of times and I'll show you something that happens here, so you see the little spinner that shows up there and then the content loads. Yep. So, um, you know what, maybe I can share my screen because I want to show like a couple of things. Uh, [00:28:53] swyx: yeah. I'll fill in some context, like I knew that the renderToNodeStream API was not good enough, basically because everyone who is doing SSR was doing like a double pass render just to get the data in. Um, and I noticed a very big sticking point for Airbnb so much that they were almost like forking react to something like that to, [00:29:11] Sunil Pai: they invented a caching API. [00:29:13] They did like a whole bunch of things. Okay. So if you have a look here, you'll see that there's a little bit of spinner and then the content comes in. But now what I'm going to do is I'm going to show you the actual HTML. So let's just go to prettier and just pretty far this, for that, we can see the content and I'll show you something that's very like fundament. [00:29:32] That's the playground playground paste, big HTML. All right. So are you looking at this HTML it's rendering rendering by the way, this, these are special comments that mark suspense boundaries. It's very cool. If you come down here, you'll see a dev, which is the spinner. So this is the spinner that you see when you refresh the page. [00:29:52] So this is. And then the rest of like then, like the, like the bits that are below that close and the HTML closes, but content still start stream is streaming in at that point. So like, this is the actual, like devs that are coming in with the content. And then a script tag gets injected that says, Hey, this thing that just came in, shove it into where the spinner was. [00:30:13] This template [00:30:14] swyx: tag is so small. I would, I would have imagined it was much bigger. [00:30:18] Sunil Pai: It's not. So by the way, at this point, the react has not loaded. This is happening without react. This is just a little DOM, much like swelled ha uh, just a little operation that does it. So you, you, you get this content. And, uh, so, so that's the first feature which is that suspense. [00:30:35] It not only works out of the box, but fallbacks and replacing or fallbacks with actual content also happened. Um, I want to pull this outside of this main window to show you something. Um, so you can see the content load in, but keep an eye on the loading spinner. Okay. Just to prove a point. So the content loads in, oh man. [00:30:56] Oh, is it cash just that way? Uh, the content loads in, but the spinner is still going on. That's because there's an artificial delay for the react bundle to show, to show up. That's the point of this demo, which is to show that it can do async. Now you can imagine that it's not just one part of the page. [00:31:13] There could be multiple suspense boundaries here, some with something heavy, something with something asynchronous and they're potentially streaming in effectively in parallel in the, like after the HTML tag closes and they load nicely the, the other cool feature, which is a feature, every framework should steal is if you do a second refresh and here, I think if the, if you do a second refresh and at this point, the react bundle, the JavaScript bundle is cached. [00:31:42] So it loads before the react, the server. Finish the streaming. So at that point, the react says, fuck you, I don't care about the streaming bit anymore. I'm taking over, it's now a client set up like just automatically out of the box, because now that would be faster. So it basically raises the client and suicide. [00:31:58] So suspends working out of the box itself is like a big deal first. So people will start using it like with react dot lazy, but then with data fetching and a bunch of slate styling solutions, which they're also working on. Um, but this is the new server entering API. The reason I was talking about this, I keep losing context about these things. [00:32:19] I should stop sharing, I guess. Um, the absolute best feature of this of course is the reason why is something that comes out of Facebook, which is it works with existing applications and you can incrementally add it. So the first thing you will do is you'll take your render to string that one line somewhere in your code base, which says rendered to. [00:32:39] And you'll replace it with vendor to notable readable string. I mean, [00:32:43] swyx: either way 99% of users have never used render to string. Right. That's what next year is for. [00:32:51] Sunil Pai: Well, that's the, that's all my God. That's part of a whole other conversation, right? [00:32:54] swyx: This is rendered a string as a service. [00:32:59] Sunil Pai: The moment you update next, year's your version of next year? So work on yes. [00:33:04] swyx: Which is good, which is good. Right? Because, uh, people won't even know and they will just benefit, but it's, it's a little bit bad. Okay. [00:33:13] Next.js and the Open Source Commons [00:33:13] swyx: And this is a little bit of my criticism, which is that your blessing, a meta framework, at the expense of all the others, right? Like which admittedly have not been as successful, but, uh, basically reacts Chrome, picked a winner and it was next year. [00:33:27] Sunil Pai: I've been thinking about this so much. Oh, look, it let's get into them at our conversation now. So let's standard disclaimers. I think Guillermo is a mench. [00:33:35] I think the people who work there are incredible. There are some people I'm close to. I'm so happy for them. I know people on the Chrome team who work with these folks. I love them as well. Nicole for me is, uh, is a hero. Uh, and of course the React team at all my buddies, I love them. Okay. That being said, the React team is six people and they don't have the time to build the meta framework and Guillermo, uh the one thing he's incredible at is he's great at building relationships. [00:34:03] He's just amazing at that. Like he, uh, in a very genuine way, like this, there's nothing like ulterior about it. Next JS is open-source and runs on any node runtime and it's designed to do so. There's nothing about it. That's become special on Vercel. Because of that the React team felt, feel like, okay, fine. [00:34:20] We can have a primitive and meta frameworks will solve it. And let's just make sure it works with next two years, because so many other people who are just reach out to them and say, Hey, this new API is showing up. Uh, this is not just with next.js. It's a similar thing is with like react testing library. [00:34:34] When the new activity I showed up, right. I made the PRS to react testing library. I was like, what you should do is have every function and react testing library be wrapped in back act. So nobody really has to like use the API by hand. I just, it's now it's the D and it's a very good testing framework, the Chrome team. [00:34:53] And this is my, I'm not saying this, like, it's a bad thing. I think they did the right thing. The Chrome team realized that if they provide performance enhancements to next years directly, they can have so much impact on the internet because so much of the react tool is running on next year. So fixing how the images are loaded in next year certainly makes the internet faster. [00:35:15] Yeah. And maybe that's what we should do also like for the accessibility, just ship acts in, uh, all the acts rules in development mode, either in like react Dom directly, or at least the next years. Oh yeah. The sweatshop, the axles. Yeah. [00:35:33] swyx: Oh, they're enabled by default. And, uh, your, your app one compile, uh, actually I think it would warn you won't fail by a worn. [00:35:40] Sunil Pai: Okay. So you should be making the swag folks should be making way more noise about that. That is such an incredible draw for accessibility. [00:35:48] swyx: The thing is like, uh, if you encourage, if you think that your, your problems are solved by X, then you're taking a very sort of paint by numbers approach to accessibility. [00:35:57] Right. Which is actually kind of against the spirits of, of, uh, what people really want, which is, um, real audits with like tap through everything. Like the stuff that machines could catch is so little, [00:36:08] Sunil Pai: I agree. The whole point of actual SIS to make sure that all the low hanging fruit is done by default. [00:36:15] It's like TypeScript, like I guess, which is a TypeScript. Doesn't solve all your bugs, but the stupid undefined is not a function once it does. Yeah, exactly. Make sure that your images have. Just by default, like we can have stronger conversations about tab order once you make sure all your images have all tags. [00:36:35] swyx: Uh, okay. Anyway, so, so yeah. So first of all, yeah, I agree with you on the, on this Chrome. And, uh, I think this is opensource winning, right? Like, uh, there's a, there's a commons. Vercel built the most successful react framework, Nate. They went the investor really hard at it. They had the right abstraction level, you know, not too much, not too little, just the right one. [00:36:55] Uh, and now everyone is finding them as like the Schelling point, which is a word I'm coming to use a lot, uh, because you know, that is the most impact that you reach. Uh, so no hate on any of them. It's just like it happens that a venture backed startup benefits from all of this. [00:37:11] Sunil Pai: Can you imagine how hard it makes my job? [00:37:13] We don't run, not on CloudFlare workers, which means Next.js doesn't run on it. It's annoying. [00:37:19] swyx: Oh, is there any attempt to make it run? [00:37:22] Sunil Pai: There are a couple of ways where we can get it to work, but it like, it's a lot of polyfill and, uh, we'll get that. Like, I expect it to be fixed within the next three to six months, but out of the box, it doesn't run on it. [00:37:35] And for me in my head, it doesn't, it's not even about CloudFlare workers. I'm like, oh shit. That's what makes Bezos like even richer because everyone's got, has, if you want to use Nadia using AWS or Lambda. And that just means more folks are using AWS. I'm just like, okay, I guess. Sure. I know you work there as well, but it's just very annoying to me where I'm like, shit. [00:37:56] What's even more interesting is that node is now moving to implementing web standard APIs inside of it. So they already have the streams implementations. They will have fetch fetch will be a node API. Like it will be implemented based on standards, which means the request response objects. And once that happens and people, if people build frameworks on that, then you can say that it will run on CloudFlare workers because the cloud fed worker's API is also like a standards based thing. [00:38:21] So it's an interesting shift of like what's happening in the, in the runtime world. Also conveniently the person who implemented the web stream implementation at node just started at CloudFlare like last month, like James. Oh, James [00:38:38] swyx: now. Okay. Yeah. I recognize [00:38:39] Sunil Pai: a great guy by the way. Uh, very, I just love these people who have like clarity of thought when they talk James as well. [00:38:46] The Third Age of JavaScript [00:38:46] swyx: We're kind of moving into the other topic of like JavaScript in 2021. Right. So first of all, I have a meta question of how do you keep informed of all this stuff? Like I ha I had no idea before you told me about this Node stuff. How do you know? [00:38:57] Sunil Pai: I have an internet information junkie problem. [00:39:00] I replaced the weed smoking habit with a Twitter habit. This is what it is. [00:39:05] swyx: You're not unlike some magic mailing lists that like tells you all this stuff. Okay. [00:39:09] Sunil Pai: Like reading the tea leaves is what it is. Like. I keep trying to find out what's going on. The problem [00:39:14] swyx: is I, I, I feel like I'm ready. I'm relatively plugged in, but you're like, you're way more plugged in than me. [00:39:22] and then this development with node adopting web standard APIs, um, is this a response to Deno? [00:39:28] Sunil Pai: I don't know if it's a response to Deno because I know Mikeal Rogers wrote about this. Like your. That we made a mistake by trying to polyfill note APS and browser code with like modules and stuff. [00:39:41] Right? Like that's what the whole browser, if I, during those days, when we started actually using the same module system and the word isomorphic came up, what ended up happening was naughty APIs were polyfill in web land, but what should have happened is we should have gone the other way. And it would have kept like bundle this bundle size problem would have been a web smaller pro problem right now, just because of that. [00:40:07] So I know that the folks at not have been thinking about it for awhile, maybe Deno finally pushed them to do it, but I don't, I don't, I don't, I don't think it's like that reductive. I don't think it's just, it's just dental. It's very much a, this is the right time to do it and we actually can do it now. [00:40:22] So let's like flesh it out and do it the right way. Uh, and it's hard to do it in, in no, right. It's not just that you can just implement this thing. Like, what does making an HTTP server mean now? Because the request response objects are slightly different in shape. So you have to make sure that you don't break existing code. [00:40:39] So it's not as simple as saying, oh, we're just implementing the APS. That being said, having fetch inside node proper is going to be great. I think. Excellent. [00:40:47] swyx: Yeah. Yeah, no more node fetch. Um, yeah. You know, my other thoughts on I've been, I've been doing this talk called the third age of JavaScript. Right. [00:40:55] Which is a blog post that I wrote last year that, um, honestly I feel quite a bit of imposter syndrome around it because all I did was name a thing and like it was already happening. It was, you already saw, like, I think basically when, when COVID hit, a lot of people were. I have a lot of time on my hands, I'm going to make new projects or something. [00:41:14] Um, uh, and then, yeah, so I just, I named it and I just called it a few trends. So the, the trends I I'm talking about are the rise of IES modules first, you know, in, in development and in production, uh, concurrently the death of 11, which I'm also tracking. [00:41:30] Sunil Pai: Yes, those are, those are both come to fruition. [00:41:34] swyx: Which, by the way, I think the us government will have to drop by 11, uh, sometime in the next six months or so because, uh, the, the use, the usage levels have plummeted. [00:41:43] 3.6% of all visits to the U S government website in November, 2020 was I 11 and now that has dropped to 1.6, um, all [00:41:51] Sunil Pai: accelerating the drop is actually accelerating. [00:41:53] swyx: Uh, I don't know if it's accelerated it's everything, but it's under the 2% mark that the us government sets for itself. [00:41:59] They have an opportunity to essentially say like once it's stable, you know, there's no chance that it'll ever go back up again. Uh, they could just deprecate 11 for all government websites and then that, that will be the signal for all enterprises. And that's it. Yeah. So, um, and then the second. [00:42:15] Oh, I was going to move on the second bit. But what was your calling? [00:42:19] Sunil Pai: Oh, just saying that this happened, like, while I was working in JP Morgan over the last year, they did the same shift, but they're like, yeah, we are now a Chrome company. Literally none of our clients are asking for this and you know, it, it was just in rules somewhere, or we need to target, I 11, some people looked at it and said, okay, fine. [00:42:35] What happened is people are spending money on something that wasn't giving them the returns. And that's when a bank is like, yeah, we don't need [00:42:41] swyx: to do this anymore. Like you, you can deprecate free support. Right. And, and just make, just charge for 11 support, stop spreading it out among all the other users who are bearing the cost of development and maintenance. [00:42:54] The other one was collapsing layers, which is the death of Unix philosophy. Like , we used to have one tool does one thing, but now we want to combine everything. So, uh, Deno and Rome both have ambitions of linter format or test runner, all of that into a single binary, because the idea of what we want out of a default runtime has changed, uh, from a, for a very minimalist thing. And I always made the comparison to what word processors used to be like. [00:43:18] So, are you aware of Benedict Evans? He has a blog post, which is amazing about what a job of a platform should be. And he talks about like in 1980s word processors used to only let you type words. And if you wanted a horizontal layout, if you wanted word counts, if you wanted footnotes, these are all plugins that you buy and install separately. [00:43:38] Right. Okay. So, but as we evolve, as we just use all these things, we realize that these are just like the same tool that we want out of a word processor. So then they absorb all these features instead of plugins. They're just part of the platform now. They're there now in the new table stakes. [00:43:53] So I make that analogy to the runtimes that already doing, right. Like, Node used to be this like much more minimal thing. And, uh, but now we are expecting more and more out of our default setup with all these tools . Um, it's also very wasteful because when each of these tools don't know each other, they're all parsing their own ASTs running, running their own code. And then yeah, that's the whole [00:44:12] Sunil Pai: proposition, but yeah. [00:44:14] swyx: Any, any tool that collapses layers will, we will meet this, like, ESBuild, um, collapsed. Like a standard web pack would do like five or six AST runs. ESBuild collapsed it to two to three. That's a source of its speed as well. [00:44:27] Sunil Pai: One of my favorite facts about ESBuild is that it is faster to minify the code than to not modify the code when you run. Yes. And the reason for that is because when you try, when it tries to do the full AST, keep comment notes, everything else, it has to do a lot more bookkeeping, but the moment it just ditches all those things, because ESBuild doesn't do like full magnification, like something like a torso, but it does do like a smaller symbol substitution, white space, uh, uh, removes all white spaces. [00:44:59] And it does like some dead code elimination. Uh, and it's a lot more work to keep the bookkeeping for everything and all the white space notes than to not do it. So he has built is actually faster when you have a modification turned on, love it. [00:45:14] swyx: It's amazing. It's amazing. [00:45:16] [00:45:16] ESbuild vs SWC vs Zig [00:45:16] swyx: Do you have opinions on ESBuild versus SWC? [00:45:18] Sunil Pai: Okay. So I like ESBuild. Uh, because I was very strongly looking for something a lot more opinionated. I've noticed that the reason that code basis Surat usually boiled down to the acute decisions that you make. Like in the very beginning of the project, you can do anything. I mean, whichever dumbass came up with the idea of baby plugin, macros has like ruined a lot of lives. [00:45:41] It was me. I came up today, but that is like, then you're like tight. So the thing that ESBuild does is very like its creator, Evan Wallace, which is that it's, it's one of a kind like, he's not really interested so much in having community, uh, uh, PRS or like having suggestions on how it should be built. He has a very strong vision of what it should be like, which is why there are no AST level plugins and all that jazz. [00:46:08] And because of that, because of, like I said, because he's collapsed the lyrics and collapse, the size of the development team to just himself, he has like such a clear vision on what it should be. So it w is good. It would be great for, I want to say 95% of projects that fall under the things he has designed at four. [00:46:28] Okay. Uh, and that's a lot of applications. That's a shit ton of applications. That's like everything, but your host, if you need anything, uh, unique, I'll give you one. That's like a very good use case that is bill will never use. Do you know what, um, uh, really has this idea of persistent queries. Okay. So like for whoever's hearing who doesn't know it, right? [00:46:52] Like you can write a query inside Java. And when it compiles it out, it takes out the query and replaces it just with an identifier, like, like a little eight character identifier. And it hosts that query instead of like on the service side. And it says, oh, that eight, eight character query, you can just hit it as a restaurant point now. [00:47:11] So you can write the code internally in JavaScript where it belongs, but it doesn't add like to your bundle or whatever it is. So ESBuild will never support this, which means if you want to do really optimizations on your react code base, you won't be able to do it all. You have to like add on to yours, which you could do. [00:47:29] I guess like you can still use Babel would, uh, SWC is meant to be a platform and which is why next years will use it because next gen is the meta framework, not just for react, but also for like some programming opinions, extracting get server props, get started, props, which one you want to be that this thing after server components comes into play, but a number of things like there will be people who always want to do. [00:47:54] The emotion macro now is like fairly, uh, popular that they will want to use it. So I assume they will implement it in, uh, interest. I know. Do you know what bun is by the way? Do you mean, do you know, how are you following Jared Sumner? Some [00:48:10] swyx: summers, no, wait, so [00:48:13] Sunil Pai: key is reimplementing ESB, but in a language called Zig it's another systems programming language. [00:48:20] And he's his claim is that it's about three times faster than you spell it right now, which is already some 200 times faster than Babel loader. It is just our web pack, but it's a language you said it? No. So the language is called Zig lines at AIG, but the thing he's building is called a, B U N. He hasn't shared it in public yet. [00:48:41] I think he's actually planning on sharing it like next week. Like I think it's that imminent. He's been sharing numbers right now. Yeah. That's the guy, Jared. Uh, I love, I should've followed him like a while ago, create great feed, uh, excellent content. And like, he's, he he's thinking that he's going to like implement. [00:48:57] He might actually implement an AST level, uh, uh, plugin, micro API, possibly just implement the emotion one. I think he was just, yeah. See, oh, that's like literally the tweet would write under the main one right there where he's like, Hey, what if we actually just did this in? Uh, oh, [00:49:14] swyx: he's right. He's he's right with you. [00:49:17] Yeah. Like he's [00:49:17] Sunil Pai: just talking about it, like right there. So, uh, so SWC versus ESBuild, I don't think is the conversation. I think ESBuild will have a rise. A bunch of people will use it. The nice thing, the best feature about ESBuild is because there are aren't any like cute decisions. You will be able to move away from it to whatever succeeds. [00:49:39] Th there's nothing customer [00:49:40] swyx: that I believe that was Evan's original idea. That IES build was a proof of existence that day there's a better way. And that he stuck to it for way longer than I thought he would. [00:49:51] Sunil Pai: People are using it in production and everything know everything about the designers that it's replaceable. [00:49:56] That it's just a, [00:49:59] swyx: that's wonderful. Isn't that amazing when people design their stuff? W. You know, it [00:50:04] Sunil Pai: isn't kind of pressure that he would have had the best. Thank goodness. It was the successful CTO of Figma with money in the bank who is implementing this and didn't have anyone to impress. You know what I mean? [00:50:16] It was like, yeah, let's put a macro API and what else do you want? Like, whatever. No, he doesn't [00:50:21] swyx: go. Yeah. But he just needs to police himself and no one else. Right. If you don't like it, [00:50:26] Sunil Pai: this is during his downtime from Figma that he's working on this. [00:50:30] swyx: Um, my, my secret theory is that he's doing this as an, as a Figma ad. [00:50:33] Like, you know, if he, if the CTO of Figma does this for fun, imagine what it's like to work inside of Figma, you know, like of, I've heard it's pretty great, [00:50:42] Sunil Pai: pretty great working inside of Figma too. Well, the code is like, it's really cool. [00:50:46] Let Non-X Do X: Figma vs Canva, Webflow vs Wix/Squarespace [00:50:46] Sunil Pai: Did you actually point out. Uh, Ken was like six times bigger than Figma. [00:50:51] Now [00:50:52] swyx: you wanna talk about that? [00:50:53] Sunil Pai: Oh God. That's. I didn't realize until you pointed it out. [00:50:58] swyx: Incredible. Imagine all the geniuses working in Figma and go looking at Canada and like, yo, like I, I have like a thousand times your features and your six times in my size as a business. [00:51:10] Sunil Pai: Uh, but I hope every one of those engineers understands the value of sales and like reaching out to your actual customers because [00:51:17] swyx: I don't think it's just sales. [00:51:18] It's more like, uh, they're always going to be more non, like, this is a category of software called let Nanex do X, right? Let non-designers do design. Whereas Figma is clearly for designers doing design. Um, and there's always going to be like a tool, three orders of magnitude more non-experts uh, who just want to do basic shit. [00:51:37] Sunil Pai: Oh man. I hope that flow has a multi-billion dollar buyout and at some point, [00:51:42] swyx: uh, I mean, I, yeah, I mean there's clearly something that w the problem with flow is that. They're too close to code. Right? You have to learn CSS the box model. [00:51:56] Sunil Pai: Yeah. I mean, they do say there's no code, but really they're a visual, [00:52:00] swyx: if you don't know CSS when using Webflow you're screwed. [00:52:03] Like [00:52:04] Sunil Pai: that's right. It's uh, they have, they have the best grid editor on the market too. I have to say that. I [00:52:10] swyx: mean, the UI is just amazing, right? It's just like, um, yeah, I mean, you know, there's a reason why like the Wix is, and the Squarespaces are actually worth more than the workflow and it's not just cause they were around earlier. [00:52:22] Like, um, they're, they're just easier to use for non-technical people. [00:52:26] Sunil Pai: That's a good, you you're talking about why did we even start talking about this? What did you want to talk about? Uh, we were talking [00:52:33] swyx: about like, uh, 32 JavaScript. Um, so I think we kind of like dealt with those, those, uh, those topics. [00:52:39] Was there anything else that you want to talk about? [00:52:40] Didn't JavaScript land, [00:52:42] JavaScript Twitter and Notion's 9mb Marketing Site [00:52:42] Sunil Pai: I don't know if you have noticed, but I've kind of actually stopped engaging in the JavaScript discourse on Twitter specifically, which actually hurts me like a little bit, because that's where all my jobs could friends are. And that's kind of like, I've seen it all. I've seen JavaScript router now for the last 11 years, I would think 10, 11 years that I've seen it. [00:53:02] And I used to like participate very heavily. And back to the thing that you, uh, that we were just discussing about the conversations that happened too, about like SBA versus MPA and about like the whole notion blow up about how they made them thing into like 800 KB. Yep. Uh, the easiest kind of discourse to have is to have like one absolutist opinion, uh, that I saw a number of people in like those threads and the surrounding threads have, which is a, well, this is bad or this is good. [00:53:35] And, uh, that's, that's all I got to say about it. Now give me like 40 likes on this reply industry. Uh, whereas like there's real opportunity here to understand how and yeah, that's the one, that's the one with treat by the way. Clearly it got like attention. No, [00:53:51] swyx: by the way I phrased it very neutrally. I actually was pretty careful. [00:53:54] Cause I knew that it's going to attract some buzz. I had no idea what's going to be this much, but [00:54:00] Sunil Pai: no, no, no. But like I'm so interested in talking about, uh, so this is what I was talking to you about, which is like, it's not just about a website at one point of time. It's about the system that generates these kinds of like artifacts, uh, of, so for example, with what, what did they say? [00:54:24] They're there 8 47 KB right now. They're not 8 47 KV today. They were 8 47 KB. When you, uh, Uh, tweeted this, uh, on the 11th, they are not in 47 KB. Now they might be 852, or they might be 841. Are you about to check? [00:54:43] swyx: No, no, no, no, I'm not. I'm not, it doesn't matter. The exact number. Doesn't matter. I'm going to give you another example, which also came up, which is Netflix. [00:54:49] Remember they ripped out react and he said they have react back [00:54:54] Sunil Pai: on Netflix. I use, are you serious on that? Wait, did they have like both Netflix, they have both react and jQuery, jQuery and react on that page right now. It's just, but like, for me, it's interesting that, which is like, I think the most insightful tweet in this was very pointed out that nobody noticed this until they told it to us. [00:55:16] Nobody saw it. It bothered. Yeah. That's the one, like nobody bothered about it. It was still making the money. They were happy about it. And they wanted to share that. And we need more of them. We need more people to be like sharing the process because if we react very badly to these things, then fewer people will want to actually share the numbers. [00:55:34] And you won't learn from the industry, but I don't know whether it's a good thing or a bad thing. It does mean that you can make a multi-billion dollar company with a marketing site. That's nine MB of Charles' script. And I think, I think people who have very strong opinions about how much jealous should be on a page to take a step back and wonder how do you make it? [00:55:55] So like, how do you, from the very beginning of like running your company, how do you make it so that it doesn't go up beyond that? Also, what opportunities are you abandoning by focusing on making sure your marketing page, uh, has like 100 KB of JavaScript instead of like nine MB [00:56:17] swyx: shipping velocity, right? [00:56:19] Sunil Pai: You are somewhere, you are spending effort on it somewhere. Just so we're clear because somebody will look at it and say, fuck you, are you suggesting that we all put in that's not what I'm saying. I'm just saying that the resources, that word, but resources at these companies are limited and they are, they they're prioritized and sequenced and you should ask yourself in what order you want to do it and who you're trying to please, are you trying to please your customers and your users or the peanut gallery on Twitter? [00:56:48] And I think that's something that like, I, it's why I don't engage so much anymore because it's so hard to communicate in once and somebody will come in with a, well, fuck you, you work for Facebook or used to work for Facebook. What would you know? I'm like, you got me that kind of ends the conversation that, right. [00:57:04] Like I'm studying contributed to babies being burned alive or whatever it is like, this is what it is. [00:57:12] swyx: Um, it's a nuanced debate, like, uh, because they also did some like notion clearly did some stupid stuff here. Right? Like it, it, they could have spent a day. Uh, so do you know why it was 9.9 megabytes? [00:57:25] Sunil Pai: If I understand it was the whole notion that that was being used, the [00:57:27] swyx: whole app. [00:57:28] Yeah. They were shipping the whole, there was actually someone from notion, uh, answering me. Uh, it's here. Yeah. This guy's, this guy works at notion before the marketing site was another route in our, at the time 9.1 NBME and app, we load the whole app just to show the sign up button. [00:57:44] So what, [00:57:45] Sunil Pai: what it's worth Facebook sign up page does start prefetching actual Facebook code so that once you log in it loads instantaneously. So there's a reason to do it. It's just that it shouldn't be nine and B of course. That's [00:58:00] swyx: yeah, they could have like took a day every, every six months or something like perfect day, you know, and do that. [00:58:06] So that's why I'm hesitant, uh, giving them a pass for like, okay, so what your multi-billion dollar company? This is embarrassing. This is just an unprofessional. Um, so yes,

covid-19 god ceo amazon netflix game canada learning australia google business technology future australian drop wrap utah rome nfts whatsapp airbnb vr seo finish incredible honestly jeff bezos dom cto react flat babel shopify ux api redux ironically chrome jp morgan canva ui picasso aws ml server java github notion apis components stripe cms hydrogen vm sis javascript html mb slides temporal apache sba book of mormon imperative cpu css u s kb ast oh god sql wix cloudflare prs node gatsby js mpa kv figma v8 cdn dst lambda sunil dsp aps surat angular aig concurrent mongodb json unix zig typescript isr graphql dsl framers webflow jquery cdm deno ssr tla uda esb dpr ies schelling netlify auth0 jamstack stouffer third age svelte dsg declarative swc turpentine stately jsx benedict evans ecmascript matthew phillips adam morris asts ocaml cloudfare otrs eular gsx dagster james allworth react server components xstate david khourshid settimeout glen martin mikeal rogers sinofsky sunil pai fb5 nanex

#246 Love your crashes, use Rich to beautify tracebacks

Python Bytes

Play Episode Listen Later Aug 11, 2021 46:19

Watch the live stream: Watch on YouTube About the show Sponsored by us: Check out the courses over at Talk Python And Brian's book too! Special guest: David Smit Brain #1: mktestdocs Vincent D. Warmerdam Tutorial with videos Utilities to check for valid Python code within markdown files and markdown formatted docstrings. Example: import pathlib import pytest from mktestdocs import check_md_file @pytest.mark.parametrize('fpath', pathlib.Path("docs").glob("**/*.md"), ids=str) def test_files_good(fpath): check_md_file(fpath=fpath) This will take any codeblock that starts with ```python and run it, checking for any errors that might happen. Putting assert statements in the code block will actually check things. Other examples in README.md for markdown formatted docstrings from functions and classes. Suggested usage is for code in mkdocs documentation. I'm planning on trying it with blog posts. Michael #2: Redis powered queues (QR3) via Scot Hacker QR queues store serialized Python objects (using cPickle by default), but that can be changed by setting the serializer on a per-queue basis. There are a few constraints on what can be pickled, and thus put into queues Create a queue: bqueue = Queue('brand_new_queue_name', host='localhost', port=9000) Add items to the queue >> bqueue.push('Pete') >> bqueue.push('John') >> bqueue.push('Paul') >> bqueue.push('George') Getting items out >> bqueue.pop() 'Pete' Also supports deque, or double-ended queue, capped collections/queues, and priority queues. David #3: 25 Pandas Functions You Didn't Know Existed Bex T So often, I come across a pandas method or function that makes me go “AH!” because it saves me so much time and simplifies my code Example: Transform Don't normally like these articles, but this one had several “AH” moments between styler options convert dtypes mask nasmallest, nalargest clip attime Brian #4: FastAPI and Rich Tracebacks in Development Hayden Kotelman Rich has, among other cool features, beautiful tracebacks and logging. FastAPI makes it easy to create web API's This post shows how to integrate the two for API's that are easy to debug. It's really only a few simple steps Create a dataclass for the logger config. Create a function that will either install rich as the handler (while not in production) or use the production log configuration. Call logging.basicConfig() with the new settings. And possibly override the logger for Uvicorn. Article contains all code necessary, including examples of the resulting logging and tracebacks. Michael #5: Dev in Residence I am the new CPython Developer in Residence Report on first week Łukasz Langa: “When the PSF first announced the Developer in Residence position, I was immediately incredibly hopeful for Python. I think it's a role with transformational potential for the project. In short, I believe the mission of the Developer in Residence (DIR) is to accelerate the developer experience of everybody else.” The DIR can: providing a steady review stream which helps dealing with PR backlog; triaging issues on the tracker dealing with issue backlog; being present in official communication channels to unblock people with questions; keeping CI and the test suite in usable state which further helps contributors focus on their changes at hand; keeping tabs on where the most work is needed and what parts of the project are most important. David #6: Dagster Dagster is a data orchestrator for machine learning, analytics, and ETL Great for local development that can be deployed on Kubernetes, etc Dagit provides a rich UI to monitor the execution, view detailed logs, etc Can deploy to Airflow, Dask, etc Quick demo? References https://www.dataengineeringpodcast.com/dagster-data-applications-episode-104/ https://softwareengineeringdaily.com/2019/11/15/dagster-with-nick-schrock/ Extras Michael: Get a vaccine, please. Python 3.10 Type info ---- er Make the 3.9, thanks John Hagen. Here is a quick example. All of these are functionally equivalent to PyCharm/mypy: # Python 3.5-3.8 from typing import List, Optional def fun(l: Optional[List[str]]) -> None: # Python 3.9+ from typing import Optional def fun(l: Optional[list[str]]) -> None: # Python 3.10+ def fun(l: list[str] | None) -> None: Note how with 3.10 we no longer need any imports to represent this type. David: Great SQL resource Joke: Pray

science education pr news rich brain development pray putting web software developers joke programming residence references api qr open source python ui data science tutorials crashes dev optional extras queue utilities cloud computing ide kubernetes software developers web development das k etl redis readme langa airflow psf warmerdam pycharm dagster talk python python3 john hagen vincent d warmerdam

4. ETL-инструменты (гостевой)

Data Coffee

Play Episode Listen Later Jun 26, 2021 42:02

Тема выпуска “ETL-инструменты” В гостях у подкаста `Data Coffee` ведущий разработчик игрового хранилища mail.ru и сооснователь русскоязычного сообщества airflow - Дина Сафина (Facebook, Telegram) Shownotes: 02:05 Два пути IT — либо кофе, либо алкоголь 04:09 Что такое ETL 08:20 Зачем мне ETL, если я — программист 09:25 Как выбрать ETL-инструмент 11:40 Airflow и другие инструментыы 18:57 Внутреннее устройство Airflow 27:49 Airflow-as-a-Service 33:57 Другие open-source решения 36:06 Dagster — убийца Airflow Обложка - https://airflow.apache.org Канал в Telegram: https://t.me/datacoffee, профиль в Twitter: https://twitter.com/_DataCoffee_ Чат подкаста, где можно предложить темы для будущих выпусков, а также обсудить эпизоды: https://t.me/datacoffee_chat

telegram etl airflow dagster

Data Orchestration and Config as Code with Nick Shrock, Creator of Dagster

The Sequel Show

Play Episode Listen Later Jun 4, 2021 58:43

In this live recording, I'm joined by Nick Schrock, the creator of Dagster.

creator data code orchestration config dagster

Dagster with Nick Schrock

Contributor

Play Episode Listen Later May 19, 2021 34:58

Eric Anderson (@ericmander) interviews Nick Schrock (@schrockn) about Dagster, the open-source data orchestrator for machine learning, analytics, and ETL. Nick is the founder and CEO of Elementl, and is well-known for creating the Project Infrastructure group at Facebook, which spawned GraphQL and React. On today’s episode of Contributor, Nick explains how he set out to fix an inefficiency he identified amongst the complexity of the data infrastructure domain. In this episode we discuss: Dagster’s place in the industry shift towards thinking of data as a software engineering discipline Why Nick believes it’s time for the term “data cleaning” to be retired The empowerment of Dagster’s instantaneous spin-up process and local development experience How a partner integrated Dagster into workflow for ops workers on the warehouse floor One user’s testimony that, “what dbt did for our SQL, Dagster did for our Python” Links: Dagster Elementl GraphQL React dbt Snowflake Apache Airflow People mentioned: Lee Byron (@leeb) Dan Schafer (@dlschafer) Abe Gong (@AbeGong)

ceo react contributors sql graphql etl eric anderson schrock dagster lee byron

#37 DoK Community: Running Data Replication Pipelines on Kubernetes with Argo // Stephen Bailey

Data on Kubernetes Community

Play Episode Listen Later Mar 25, 2021 62:21

Abstract of the talk… Hundreds of data teams have migrated to the ELT pattern in recent years, leveraging SaaS tools like Stitch or FiveTran to reliably load data into their infrastructure. These SaaS offerings are outstanding and can accelerate your time to production significantly. However, many teams prefer to roll their own tools. One solution in these cases is to deploy singer.io taps and targets — Python scripts that can perform data replication between arbitrary sources and destinations. The Singer specification is the foundation for the popular Stitch SaaS, and it is also leveraged by a number of independent consultants and data projects. Singer pipelines are highly modular. You can pipe any tap to any target to build a data pipeline that fits your needs, making them a good fit for containerized workflows. This article walks through the workflow at a high level and provides some example code to get up and running with some shared templates. I also drill into reasons for choosing the Argo approach over other orchestration tools like Airflow or Dagster, and the implications from a team perspective. Bio… Stephen Bailey is Director of Growth Analytics at Immuta, where he strives to implement privacy best practices while delivering business value from data. He loves to teach and learn, on just about any subject. He holds a PhD in educational cognitive neuroscience from Vanderbilt and enjoys reading philosophy

director community running phd data singer saas hundreds vanderbilt python stitch abstract pipelines argo kubernetes replication elt airflow stephen bailey immuta dagster

Podcasts about dagster

Best podcasts about dagster

Data Engineering Podcast

The Data Stack Show

What's New In Data

Contributor

Data Driven

The Machine Learning Podcast

Data Coffee

Latest news about dagster

Latest podcast episodes about dagster

Re-Air: Bridging Gaps: DevRel, Marketing Synergies, and the Future of Data with Pedram Navid of Dagster Labs

From Data Engineering to Context Engineering w/ Nick Schrock

Beyond the Dashboard: Collaborative Analytics in Slack

#381 | Watch It Fly By As the Pendulum Swings

Co-creator of GraphQL and Founder of Dagster Labs - Nick Schrock

Redif Top 3 : Comment l'ex-Head of Data de Lydia monte le département Data chez May

#126 | Self-Discipline, Going All In

246: AI, Abstractions, and the Future of Data Engineering with Pete Hunt of Dagster

The PRQL: Breaking Down Silos: Collaborative Data Engineering in the AI Era with Pete Hunt of Dagster

#123 | Moms, Worst Job Experiences

241: Marketing Meets Data: Measuring Impact and Driving Results with Pedram Navid of Dagster Labs

The PRQL: Shifting Gears: From Code to Marketing in the Data World with Pedram Navid of Dagster Labs

#121 | Horror Games and Movies, Cartoons, Favorite Rivalries

#119 | Favorite Storytelling Mediums, Multimillionaire What If!

#118 | Nintendo Switch 2, Advice for Our 15-Year-Old Selves, Our Favorite Instrumental Music

#117 | Wrestling, Plans that Backfired

#115 | Toasters, The Things We're Surprised We Still Haven't Done, Puppets

Data Orchestration: DataOps ed MLOps. #63

#349 | The Last of Us, Seriously

Trends in Data Engineering – Adrian Brudaru

#112 | Historical Eras, AI, Bad Music

#111 | Fragrances, Phone Calls, World Records

Lessons in Data Engineering: Scaling, AI, and Open Source with Sandy Ryza

#106 | Media Binging, Natural Disasters, Domestic Squabbles

224: Bridging Gaps: DevRel, Marketing Synergies, and the Future of Data with Pedram Navid of Dagster Labs

The PRQL: Developer Relations, Marketing Synergies, and the Future of Data Platforms with Pedram Navid of Dagster Labs

#103 | Tear Jerkers, Competitiveness, Daylight Savings Time

#102 |Things We're NOT Afraid Of, Best Friends, Online Rage Bait

#101 | Disney Remakes, Chivalry, 2024 Reflections

From Marine Aircraft to Data Engineering: Navigating Social Media Shifts and AI Innovation with Alex Noonan

#95 | Reflections, Doctors, Fantasy Worlds

From GraphQL to Dagster Labs: How Nick Schrock Is Reinventing Data Infrastructure

AI Pipelines with Maxime Armstrong and Yuhan Luo

AI Pipelines with Maxime Armstrong and Yuhan Luo

Building Data Tooling with Sandy Ryza | Ep. 43

Episode 39: The Impact of Data Science on Data Orchestration

Episode 39: The Impact of Data Science on Data Orchestration

Release Management For Data Platform Services And Logic

#126 - Comment l'ex-Head of Data de Lydia monte le département Data chez May

High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor

Designing A Non-Relational Database Engine

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

#454: Data Pipelines with Dagster

2827: Data, Decisions, and Dagster: Nick Schrock's Blueprint for Engineering Excellence

From Concept to Market: The PMF Journey of Dagster

Episode 65: Scaling Data Pipelines with Nick Schrock, Founder/CTO of Dagster Labs

Data Sharing Across Business And Platform Boundaries

Data Engineering, AI, Entrepreneurship - Unveiling 2024's Tech Horizons, A Conversation With Nick Schrock - SPaMCAST 794

Learning and Sharing in Public with Dagster Lab's Pedram Navid

Tackling Real Time Streaming Data With SQL Using RisingWave

Bridging the Developer Education Gap with Tim Castillo of Dagster Labs

The role of AI and LLMs in data -- Pedram Navid // Dagster Labs

Cutting through the noise of data products -- Pedram Navid // Dagster Labs

Revolutionizing Data Engineering: Dagster's Journey and Open-Source Impact with CTO Nick Schrock | Ep 789

171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

The PRQL: Does Machine Learning Need Its Own Orchestrator? Featuring Sandy Ryza of Dagster

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Addressing The Challenges Of Component Integration In Data Platform Architectures

The Journey from Engineer to CEO and Lessons Learned Along the Way with Pete Hunt

Pete Hunt, CEO of Elementl/Dagster

How to Work Effectively With Your Data Teams With Nick Schrock, Founder and CTO of Dagster Labs

Open source data orchestration

158: The Orchestration Layer as the Data Platform Control Plane With Nick Schrock of Dagster Labs

The PRQL: The Power of Data Orchestration: A Game-Changer for Data Infrastructure, Featuring Nick Schrock of Dagster Labs

E105: Bringing Great Developer Experience to Data Teams with Dagster

S8 Bonus: Pete Hunt, Dagster

An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem