POPULARITY
In this episode of The Data Engineering Show, the bros welcome the CEO DuckDB Labs and co-creator DuckDB, Hannes Mühleisen. They delve into the groundbreaking journey of DuckDB, an analytical database that processes billions of queries every month. Learn why DuckDB prioritizes broad compatibility over specialized optimizations, how its extension model works and the emerging solutions for database technology in the age of AI.
In this podcast episode, we talked with Adrian Brudaru about the past, present and future of data engineering.About the speaker:Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted.As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.0:00 Introduction to DataTalks.Club1:05 Discussing trends in data engineering with Adrian2:03 Adrian's background and journey into data engineering5:04 Growth and updates on Adrian's company, DLT Hub9:05 Challenges and specialization in data engineering today13:00 Opportunities for data engineers entering the field15:00 The "Modern Data Stack" and its evolution17:25 Emerging trends: AI integration and Iceberg technology27:40 DuckDB and the emergence of portable, cost-effective data stacks32:14 The rise and impact of dbt in data engineering34:08 Alternatives to dbt: SQLMesh and others35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions37:20 Audience questions: Career focus in data roles and AI engineering overlaps39:00 The role of semantics in data and AI workflows41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem
A major milestone for leveraging LLMs in R just landed with the new ellmer package, along with a terrific showcase of retrieval-augmented generation combining ellmer and DuckDB. Plus an inspiring roundup of the recent Closeread contest winners.Episode LinksThis week's curator: Sam Parmar - @parmsam@fosstodon.org (Mastodon) & @parmsam_ (X/Twitter)Announcing ellmer: A package for interacting with Large Language Models in RRapid RAG Prototyping: Building a Retrieval Augmented Generation Prototype with ellmer and DuckDBWinners of the Closeread Prize – Data-Driven Scrollytelling with QuartoEntire issue available at rweekly.org/2025-W10Supplement ResourcesCoder Radio episode 608 - R with Eric Nantz https://coder.show/608nhyris - The minimal framework for transform R shiny application into standaloneSupporting the showUse the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedbackR-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.A new way to think about value: https://value4value.infoGet in touch with us on social mediaEric Nantz: @rpodcast@podcastindex.social (Mastodon), @rpodcast.bsky.social (BlueSky) and @theRcast (X/Twitter)Mike Thomas: @mike_thomas@fosstodon.org (Mastodon), @mike-thomas.bsky.social (BlueSky), and @mike_ketchbrook (X/Twitter) Music credits powered by OCRemixWatermelon Flava - Breath of Fire III - Joshua Morse, posu yan - https://ocremix.org/remix/OCR01411Stomp the Summer Sky - Secret of Mana - Ziwtra - https://ocremix.org/remix/OCR00859
En el episodio 96 del podcast de Entre Dev y Ops hablaremos del veinticinco aniversario de la FOSDEM. Blog Entre Dev y Ops - https://www.entredevyops.es Telegram Entre Dev y Ops - https://t.me/entredevyops Twitter Entre Dev y Ops - https://twitter.com/entredevyops LinkedIn Entre Dev y Ops - https://www.linkedin.com/company/entredevyops/ Patreon Entre Dev y Ops - https://www.patreon.com/edyo Amazon Entre Dev y Ops - https://amzn.to/2HrlmRw Enlaces comentados: Fosdem 2025 - https://fosdem.org/2025/ Fosdem Treasure Hunt - https://fosdem.org/2025/news/2025-01-30-treasure-hunt/ Curl - https://curl.se/ Luanti (formerly Minetest) - https://www.luanti.org/ 0 A.D. - https://play0ad.com/ The Battle for Wesnoth - https://www.wesnoth.org Charla optimización JavaScript - https://fosdem.org/2025/schedule/event/fosdem-2025-4391-how-to-lose-weight-optimising-memory-usage-in-javascript-and-beyond/ Charla DuckDB y graph queries - https://fosdem.org/2025/schedule/event/fosdem-2025-4135-empowering-data-analytics-high-performance-graph-queries-in-duckdb-with-duckpgq/ Charla segundo cerebro - https://fosdem.org/2025/schedule/event/fosdem-2025-6542-building-your-local-llm-second-brain/ Charla ecosistema Huggingface - https://fosdem.org/2025/schedule/event/fosdem-2025-6341-hugging-face-ecosystem-for-local-ai-ml/ DuckDB - https://duckdb.org DuckDB Con en Amsterdam - https://duckdb.org/events/2025/01/31/duckcon6/ Charla Leslie Lamport - https://fosdem.org/2025/schedule/event/fosdem-2025-4941-was-leslie-lamport-right-/ Paper sobre consistencia - https://www.scs.stanford.edu/17au-cs244b/labs/projects/clow_jiang.pdf immich - https://immich.app/ FuriLabs - https://furilabs.com/ TinyGo - https://tinygo.org Gopher Badge - https://gopherbadge.com/ FastHMTL - https://fastht.ml/ Contexto de FastHTML para LLMs - https://docs.fastht.ml/llms-ctx.txt Xwiki - https://www.xwiki.org EL BOLI de la discordia - https://www.amazon.com/Tactical-Multi-Tool-Utility-Screwdriver-Touchscreen/dp/B0BGQXVCFD
In this episode, we explore DuckDB, an open-source analytical database known for its speed and simplicity. Discover how DuckDB stands out in various applications and compare it to other tools like SQLite, Athena, Pandas, and Polars. We also demonstrate integrating DuckDB with AWS Lambda and Step Functions for serverless analytics.AWS Bites is brought to you by fourTheorem. If you are looking for a partner to architect, develop and modernise on AWS, give fourTheorem a call. Check out fourtheorem.comIn this episode, we mentioned the following resources: Our `duck-query-lambda`, A Lambda runtime for DuckDB queries: https://github.com/fourTheorem/duck-query-lambda DuckDB's official website: https://duckdb.org/ LibSQL: https://github.com/tursodatabase/libsql Do you have any AWS questions you would like us to address?Leave a comment here or connect with us on X/Twitter, BlueSky or LinkedIn:- https://twitter.com/eoins | https://bsky.app/profile/eoin.sh | https://www.linkedin.com/in/eoins/- https://twitter.com/loige | https://bsky.app/profile/loige.co | https://www.linkedin.com/in/lucianomammino/
Le BigDataHebdo, reçoit Mehdi, Developer Advocate chez MotherDuck, pour explorer l'univers de DuckDB et MotherDuck. Au programme, les origines académiques de DuckDB, son évolution en tant que moteur SQL analytique performant, et son extension MotherDuck qui permet de l'utiliser comme un Data Warehouse en ligne.Show notes sur http://bigdatahebdo.com/podcast/episode-211-motherduck/
The GeekNarrator memberships can be joined here: https://www.youtube.com/channel/UC_mGuY4g0mggeUGM6V1osdA/join Membership will get you access to member only videos, exclusive notes and monthly 1:1 with me. Here you can see all the member only videos: https://www.youtube.com/playlist?list=UUMO_mGuY4g0mggeUGM6V1osdA ------------------------------------------------------------------------------------------------------------------------------------------------------------------ About this episode: ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Hey folks - In this episode we have Jelte with us, who is the main contributor to the pg_duckdb project, which is a postgres extension to add the #duckdb power to our beloved #postgresql. We will try to understand how it works? Why is it needed and what's the future of pg_duckdb? If you love #Postgres or #Duckdb or just understanding #database internals then this episode will give you pretty solid insights into Postgres query processing, Duckdb analytics, Postgres extension ecosystem and so on. Basics: pg_duckdb is a Postgres extension that embeds DuckDB's columnar-vectorized analytics engine and features into Postgres. We recommend using pg_duckdb to build high performance analytics and data-intensive applications. Chapters: 00:00 Introduction to PG-DuckDB 03:40 Understanding the Integration of DuckDB with Postgres 06:23 Architecture of PG-DuckDB: Query Processing Explained 10:02 Configuring DuckDB for Analytics Queries 15:37 Managing Workloads: Transactional vs. Analytical 21:02 Observability and Debugging in DuckDB 25:58 Data Deletion and GDPR Compliance 30:46 Schema Management and Migration Challenges 33:14 Managing Schema Changes in Databases 35:21 Upgrading Database Extensions 36:33 Enhancing Data Reading Methods 38:33 Future Features and Improvements 45:54 Use Cases for PGDuckDB 50:03 Challenges in Building the Extension 55:25 Getting Involved with PGDuckDB Important links: The duckdb discord server, which has a pg_duckdb channel inside it: https://discord.duckdb.org/ repo: https://github.com/duckdb/pg_duckdb good-first-issue issues: https://github.com/duckdb/pg_duckdb/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Like building real stuff? ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Try out CodeCrafters and build amazing real world systems like Redis, Kafka, Sqlite. Use the link below to signup and get 40% off on paid subscription. https://app.codecrafters.io/join?via=geeknarrator ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Link to other playlists. LIKE, SHARE and SUBSCRIBE ------------------------------------------------------------------------------------------------------------------------------------------------------------------ If you like this episode, please hit the like button and share it with your network. Also please subscribe if you haven't yet. Database internals series: https://youtu.be/yV_Zp0Mi3xs Popular playlists: Realtime streaming systems: https://www.youtube.com/playlist?list=PLL7QpTxsA4se-mAKKoVOs3VcaP71X_LA- Software Engineering: https://www.youtube.com/playlist?list=PLL7QpTxsA4sf6By03bot5BhKoMgxDUU17 Distributed systems and databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4sfLDUnjBJXJGFhhz94jDd_d Modern databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4scSeZAsCUXijtnfW5ARlrsN Stay Curios! Keep Learning! #sql #postgres #databasesystems
Jordan Tigani is the cofounder and CEO of MotherDuck, a data warehouse platform based on open source database DuckDB. They've raised $100M in funding from amazing investors like Andreessen Horowitz, Felicis, Madrona, and Altimeter. He was previously the CPO at SingleStore and spent 11 years at Google before that. He has a degree in electrical engineering from Harvard. Jordan's favorite book: The Master and Margarita (Author: Mikhail Bulgakov)(00:01) Introduction(00:08) Founding of MotherDuck(01:12) The Philosophy of Shipping Products at MotherDuck(05:02) Founding Story and Identifying the Market Opportunity(10:57) Building the First Version and Overcoming Early Challenges(12:23) Validating Customer Needs and Asking the Right Questions(18:24) Deciding What Features to Prioritize and Exclude(21:30) Positioning a New Product in a Mature Market(27:36) Overcoming Challenges in Scaling MotherDuck(32:29) Measuring Success of New Features in Enterprise Products(36:20) Structuring the Organization for Effective Execution(41:09) Preparing MotherDuck for the AI Native Era(43:28) Rapid Fire Round --------Where to find Jordan Tigani: LinkedIn: https://www.linkedin.com/in/jordantigani/--------Where to find Prateek Joshi: Newsletter: https://prateekjoshi.substack.com Website: https://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 Twitter: https://twitter.com/prateekvjoshi
Michael and Nikolay are joined by Joe Sciarrino and Jelte Fennema-Nio to discuss pg_duckdb — what it is, how it started, what early users are using it for, and what they're working on next. Here are some links to things they mentioned:Joe Sciarrino https://postgres.fm/people/joe-sciarrinoJelte Fennema-Nio https://postgres.fm/people/jelte-fennema-niopg_duckdb https://github.com/duckdb/pg_duckdbHydra https://www.hydra.soMotherDuck https://motherduck.comThe problems and benefits of an elephant with a beak (lightning talk by Jelte) https://www.youtube.com/watch?v=ogvbKE4fw9A&list=PLF36ND7b_WU4QL6bA28NrzBOevqUYiPYq&t=1073spg_duckdb announcement post (by Jordan and Brett from MotherDuck) https://motherduck.com/blog/pg_duckdb-postgresql-extension-for-duckdb-motherduckpg_duckdb 0.2 release https://github.com/duckdb/pg_duckdb/releases/tag/v0.2.0~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Talk Python To Me - Python conversations for passionate developers
Join me for an insightful conversation with Alex Monahan, who works on documentation, tutorials, and training at DuckDB Labs. We explore why DuckDB is gaining momentum among Python and data enthusiasts, from its in-process database design to its blazingly fast, columnar architecture. We also dive into indexing strategies, concurrency considerations, and the fascinating way MotherDuck (the cloud companion to DuckDB) handles large-scale data seamlessly. Don't miss this chance to learn how a single pip install could totally transform your Python data workflow! Episode sponsors Sentry Error Monitoring, Code TALKPYTHON Data Citizens Podcast Talk Python Courses Links from the show Alex on Mastodon: @__Alex__ DuckDB: duckdb.org MotherDuck: motherduck.com SQLite: sqlite.org Moka-Py: github.com PostgreSQL: www.postgresql.org MySQL: www.mysql.com Redis: redis.io Apache Parquet: parquet.apache.org Apache Arrow: arrow.apache.org Pandas: pandas.pydata.org Polars: pola.rs Pyodide: pyodide.org DB-API (PEP 249): peps.python.org/pep-0249 Flask: flask.palletsprojects.com Gunicorn: gunicorn.org MinIO: min.io Amazon S3: aws.amazon.com/s3 Azure Blob Storage: azure.microsoft.com/products/storage Google Cloud Storage: cloud.google.com/storage DigitalOcean: www.digitalocean.com Linode: www.linode.com Hetzner: www.hetzner.com BigQuery: cloud.google.com/bigquery DBT (Data Build Tool): docs.getdbt.com Mode: mode.com Hex: hex.tech Python: www.python.org Node.js: nodejs.org Rust: www.rust-lang.org Go: go.dev .NET: dotnet.microsoft.com Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to Talk Python on YouTube: youtube.com Talk Python on Bluesky: @talkpython.fm at bsky.app Talk Python on Mastodon: talkpython Michael on Bluesky: @mkennedy.codes at bsky.app Michael on Mastodon: mkennedy
Hannes Muhleisen is the creator of DuckDB and CEO of DuckDB Labs. We finally got a chance to meet in person at the Forward Data Conference in Paris. We hit it off immediately, and at times, I felt like I was talking with my long lost brother. Hannes is a very cool guy! While at the conference, we recorded a chat about all things DuckDB, the challenges of data lakehouses and open table formats, local-first tech, and much more.
Isaac Brodsky discusses the integration of H3, an open-source hierarchical hexagonal grid system, with DuckDB, an analytical SQL database, to enhance geospatial data analysis. This combination enables efficient querying and manipulation of diverse datasets in real-time. Highlights
Elizabeth Christensen of Crunchy Data walks us through how to use open source tooling to avoid paying the Esri tax. It was a great tour of the options and also a nice vibe check of the industry. A headline here is she echoes former guest Stephanie May in endorsing DuckDB. She also wanted to pass on that a great way to find out more is to attend PostGIS Day! More here. On the topic of resources, Elizabeth has been very helpful and provided the following set of links: #PostgreSQL, open source relational database https://www.postgresql.org/ #PostGIS, open source GIS data store https://postgis.net/ # PostGIS Day 2024 https://www.crunchydata.com/community/events/postgis-day-2024 #Crunchy Data, Postgres and PostGIS services provider https://www.crunchydata.com/ # Open Source Geospatial Foundation https://www.osgeo.org/ #QGIS download, open source mapping https://www.qgis.org/ #Simple map SQL queries as QGIS layers https://www.crunchydata.com/blog/connecting-qgis-to-postgres-and-postgis #pg_tileserv - Tile server for PostGIS https://github.com/CrunchyData/pg_tileserv #pg_featureserv - API JSON server for PostGIS https://github.com/CrunchyData/pg_featureserv/ #OpenLayers project https://openlayers.org/ #OpenLayers + PgRouting + pg_tileserv + pg_featureserv sample code https://github.com/CrunchyData/pg_featureserv/tree/master/demo #PostGIS day videos https://www.youtube.com/@CrunchyDataPostgres#Crunchy Data's Postgres Playground https://www.crunchydata.com/developers/tutorials #Really cool open source GIS people to follow Paul Ramsey @ Crunchy Data / cleverelephant Regina Obe @ Paragon Ryan Lambert @ RustProofLabs Cliff Patterson @ Luna Geospatial Matt Forrest @ Whereabots #Elizabeth's crunchy blogs https://www.crunchydata.com/blog/author/elizabeth-christensen #Elizabeth's LinkedIn https://www.linkedin.com/in/elizabeth-garrett-christensen/ #Elizabeth's Twitter https://twitter.com/sqlliz THE GEOSPATIAL INDEX The Geospatial Index is a comprehensive listing of all publicly traded geospatial businesses worldwide. Why? The industry is growing at ~5% annually (after inflation and after adjusting for base rates). This rate varies significantly, however, by sub index. For $480,000 to start, this growth rate is $5,000,000 over a working life. This channel, Bluesky account, newsletter, watchlist and podcast express the view that you are serious about geospatial if you take the view of an investor, venture capitalist or entrepreneur. You are expected to do your own research. This is not a replacement for that. This is not investment advice. Consider it entertainment. NOT THE OPINION OF MY EMPLOYER NOT YOUR FIDUCIARY NOT INVESTMENT ADVICE Bluesky: https://bsky.app/profile/geospatialindex.bsky.social LinkedIn: https://uk.linkedin.com/in/geospatialindex Watchlist: https://www.tradingview.com/watchlists/123254792/ Newsletter: https://www.geospatial.money/ Podcast: https://open.spotify.com/show/5gpQUsaWxEBpYCnypEdHFC
We are on the other side of "big data" hype, but what is the future of analytics and how does AI fit in? Till and Adithya from MotherDuck join us to discuss why DuckDB is taking the analytics and AI world by storm. We dive into what makes DuckDB, a free, in-process SQL OLAP database management system, unique including its ability to execute lighting fast analytics queries against a variety of data sources, even on your laptop! Along the way we dig into the intersections with AI, such as text-to-sql, vector search, and AI-driven SQL query correction.
A founding engineer on Google BigQuery and now at the helm of MotherDuck, Jordan Tigani challenges the decade-long dominance of Big Data and introduces a compelling alternative that could change how companies handle data. Jordan discusses why Big Data technologies are an overkill for most companies, how MotherDuck and DuckDB offer fast analytical queries, and lessons learned as a technical founder building his first startup. Watch the episode with Tomasz Tunguz: https://youtu.be/gU6dGmZzmvI Website - https://motherduck.com Twitter - https://x.com/motherduck Jordan Tigani LinkedIn - https://www.linkedin.com/in/jordantigani Twitter - https://x.com/jrdntgn FIRSTMARK Website - https://firstmark.com Twitter - https://twitter.com/FirstMarkCap Matt Turck (Managing Director) LinkedIn - https://www.linkedin.com/in/turck/ Twitter - https://twitter.com/mattturck (00:00) Intro (00:56) What is the Small Data? (06:56) Marketing strategy of MotherDuck (08:39) Processing Small Data with Big Data stack (15:30) DuckDB (17:21) Creation of DuckDB (18:48) Founding story of MotherDuck (24:08) MotherDuck's community (25:25) MotherDuck of today ($100M raised) (33:15) Why MotherDuck and DuckDB are so fast? (39:08) The limitations and the future of MotherDuck's platform (39:49) Small Models (42:37) Small Data and the Modern Data Stack (46:47) Making things simpler with a shift from Big Data to Small Data (50:04) Jordan Tigani's entrepreneurial journey (58:31) Outro
We are on the other side of "big data" hype, but what is the future of analytics and how does AI fit in? Till and Adithya from MotherDuck join us to discuss why DuckDB is taking the analytics and AI world by storm. We dive into what makes DuckDB, a free, in-process SQL OLAP database management system, unique including its ability to execute lighting fast analytics queries against a variety of data sources, even on your laptop! Along the way we dig into the intersections with AI, such as text-to-sql, vector search, and AI-driven SQL query correction.
Bringing tidy principles to a fundamental visualization for gene expressions, being on your best "behavior" for organizing your tests, and how data.table stacks up to DuckDB and polars for reshaping your data layouts.Episode LinksThis week's curator: Jon Carroll - @jonocarroll@fosstodon.org (Mastodon) & @carroll_jono (X/Twitter)Exploring the tidyHeatmap R packageDon't Expect That "Function Works Correctly", Do This InsteadComparing data.table reshape to duckdb and polarsEntire issue available at rweekly.org/2024-W43Supplement ResourcestidyHeatmap: Draw heatmap simply using a tidy data frame https://stemangiola.github.io/tidyHeatmap/Novel App knock-in mouse model shows key features of amyloid pathology and reveals profound metabolic dysregulation of microglia https://molecularneurodegeneration.biomedcentral.com/articles/10.1186/s13024-022-00547-7Shiny App-Packages chapter on writing tests and specifications https://mjfrigaard.github.io/shiny-app-pkgs/test_specs.htmlWANT CLEANER UNIT TESTS? TRY ARRANGE, ACT, ASSERT COMMENTS https://jakubsob.github.io/blog/want-cleaner-test-try-arrange-act-assert/Super Data Science Podcast 827: Polars: Past, Present and Future, with Polars Creator Ritchie Vink https://www.superdatascience.com/podcast/827duckplyr: A DuckDB-backed version for dplyr https://duckplyr.tidyverse.org/Supporting the showUse the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedbackR-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.A new way to think about value: https://value4value.infoGet in touch with us on social mediaEric Nantz: @rpodcast@podcastindex.social (Mastodon) and @theRcast (X/Twitter)Mike Thomas: @mike_thomas@fosstodon.org (Mastodon) and @mike_ketchbrook (X/Twitter) Music credits powered by OCRemixBlack Feathers in the Sky - Kid Icarus: Uprising - MkVaff - https://ocremix.org/remix/OCR04200Cross-Examination - Phoenix Wright: Ace Attorney - PrototypeRaptor - https://ocremix.org/remix/OCR01846
What makes MotherDuck and DuckDB a game-changer for data analytics? Join us as we sit down with Jacob Matson, a renowned expert in SQL Server, dbt, and Excel, who recently became a developer advocate at MotherDuck. During this episode, Jacob shares his compelling journey to MotherDuck, driven by his frequent use of DuckDB for solving data challenges. We explore the unique attributes of DuckDB, comparing it to SQLite for analytics, and uncover its architectural benefits, such as utilizing multi-core machines for parallel query execution. Jacob also sheds light on how MotherDuck is pushing the envelope with their innovative concept of multiplayer analytics.Our discussion takes a deep dive into MotherDuck's innovative tenancy model and how it impacts database workloads, highlighting the use of DuckDB format in Wasm for enhanced data visualization. Jacob explains how this approach offers significant compression and faster query performance, making data visualization more interactive. We also touch on the potential and limitations of replacing traditional BI tools with Mosaic, and where MotherDuck stands in the modern data stack landscape, especially for organizations that don't require the scale of BigQuery or Snowflake. Plus, get a sneak peek into the upcoming Small Data Conference in San Francisco on September 23rd, where we'll explore how small data solutions can address significant problems without relying on big data. Don't miss this episode packed with insights on DuckDB and MotherDuck innovations!Small Data SF Signup Discount Code: MATSON100What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.
Like every other kind of technology, when it comes to databases there's no one-size-fits-all solution that's going to be the best thing for the job every time. That's what drives innovation and new solutions. It's ultimately also the story behind DuckDB, an open source relational database specifically designed for the demands of online analytical processing (OLAP), and particularly useful for data analysts, scientists and engineers. To get a deeper understanding of DuckDB and how the product has developed, on this episode of the Technology Podcast hosts Ken Mugrage and Lilly Ryan are joined by Thoughtworker Ned Letcher and Thoughtworks alumnus Simon Aubury. Ned and Simon explain the thinking behind DuckDB, the design decisions made by the project and how its being used by data practitioners in the wild. Learn more about DuckDB: https://duckdb.org/why_duckdb.html
В этом выпуске мы делимся еженедельными открытиями, обсуждаем VPN в России, сравниваем Swift и Rust, говорим о DirectX 9, Windows10, DuckDB 1.1.0 и ретрогейминге. [00:03:22] Чемы мы научились на этой неделе The first professional hosting of cloud VPS/VDS servers — VDSina Open Data Protocol — Wikipedia Сварочный инвертор за 5$ своими руками! https://www.amazon.co.uk/dp/B0C9WWCQ82/ref=emc_bcc_2_i?th=1 [00:03:39] VPN который… Читать далее →
Jordan Tigani is back to chat about why small data is awesome, data lakehouses, DuckDB, AI, and much more. Motherduck: https://motherduck.com/ LinkedIn: https://www.linkedin.com/in/jordantigani/ Twitter: https://twitter.com/jrdntgn?lang=en
In this episode of AI + a16z, a16z General Partner Jennifer Li joins MotherDuck Cofounder and CEO Jordan Tigani to discuss DuckDB's spiking popularity as the era of big data wanes, as well as the applicability of SQL-based systems for AI workloads and the prospect of text-to-SQL for analyzing data.Here's an excerpt of Jordan discussing an early win when it comes to applying generative AI to data analysis:"Everybody forgets syntax for various SQL calls. And it's just like in coding. So there's some people that memorize . . . all of the code base, and so they don't need auto-complete. They don't need any copilot. . . . They don't need an ID; they can just type in Notepad. But for the rest of us, I think these tools are super useful. And I think we have seen that these tools have already changed how people are interacting with their data, how they're writing their SQL queries."One of the things that we've done . . . is we focused on improving the experience of writing queries. Something we found is actually really useful is when somebody runs a query and there's an error, we basically feed the line of the error into GPT 4 and ask it to fix it. And it turns out to be really good. ". . . It's a great way of letting you stay in the flow of writing your queries and having true interactivity."Learn more:Small Data SF conferenceDuckDBFollow everyone on X:Jordan TiganiJennifer LiDerrick Harris Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.
I had someone ask me about DuckDB recently. Would I think that's a good choice for a database. I don't really know. From their blog and some online research, maybe, but it's also a minority player in a niche space. I had a chat recently with someone that had implemented ArangoDB, a graph database. Why that and not Neo4J I asked them? Someone at the company had tried it and recommended it. Not a bad reason, as I think experience with tech is important, but it's not the only thing. Read the rest of Trying New Technology
DuckDB is an open-source column-oriented relational database that was first released in 2019. It's designed to provide high performance on complex queries against large databases, and focuses on online analytical processing workloads. Hannes Mühleisen is the Co-Creator of DuckBD, and is the CEO and Co-Founder of DuckDB Labs. He joins the show to talk about The post DuckDB with Hannes Mühleisen appeared first on Software Engineering Daily.
DuckDB is an open-source column-oriented relational database that was first released in 2019. It’s designed to provide high performance on complex queries against large databases, and focuses on online analytical processing workloads. Hannes Mühleisen is the Co-Creator of DuckBD, and is the CEO and Co-Founder of DuckDB Labs. He joins the show to talk about The post DuckDB with Hannes Mühleisen appeared first on Software Engineering Daily.
DuckDB's become a favourite data-handling tool of mine, simply because it does so many small things well. It can read and write a huge number of data formats; it can infer schemas automatically when you just want to move quickly; and it can interface with most languages, run like lightning on the desktop or be embedded into a webpage. I'm a huge fan.But I'm not nearly as knowledgeable as this week's two fans, Simon Aubury and Ned Letcher, who've just written a book on all the many ways you can use DuckDB and all the hidden tricks and tips that help you make the most of this. So in this episode we're taking a practical look at DuckDB, what problems it can solve at work, and how to start getting the most out of it.–Getting Started with DuckDB (book): https://packt.link/byKYtDuckDB episode with Hannes Mühleisen: https://youtu.be/pZV9FvdKmLcDuckDB: https://duckdb.org/dplyr, the data-manipulation language: https://dplyr.tidyverse.org/duckplyr, DuckDB's ‘native' version: https://github.com/duckdblabs/duckplyrSubstrait: https://substrait.io/Observable (Markdown+DuckDB=Reports): https://observablehq.com/framework/DuckDB's “friendly” SQL: https://duckdb.org/docs/sql/dialect/friendly_sql.htmlCommunity Extensions: https://community-extensions.duckdb.org/DuckCon #5: https://duckdb.org/2024/08/15/duckcon5.htmlSupport Developer Voices on Patreon: https://patreon.com/DeveloperVoicesSupport Developer Voices on YouTube: https://www.youtube.com/@developervoices/joinSimon on Twitter: https://x.com/SimonAuburyNed on Twitter: https://x.com/nletcherKris on Mastodon: http://mastodon.social/@krisajenkinsKris on LinkedIn: https://www.linkedin.com/in/krisjenkins/Kris on Twitter: https://twitter.com/krisajenkins
The latest updates to the rayverse bring new meaning to smoothing out the rough edges of your next 3-D visualization, the momentum of DuckDB continues with the MotherDuck data warehouse, and the role nanoparquet plays to bring the benefits of parquet to small data sets.Episode LinksThis week's curator: Eric Nantz: @rpodcast@podcastindex.social (Mastodon) and @theRcast (X/Twitter)Sculpting the Moon in R: Subdivision Surfaces and Displacement MappingJoining the flock from R: working with data on MotherDucknanoparquet 0.3.0Entire issue available at rweekly.org/2024-W26Supporting the showUse the contact page at https://serve.podhome.fm/custompage/r-weekly-highlights/contact to send us your feedbackR-Weekly Highlights on the Podcastindex.org - You can send a boost into the show directly in the Podcast Index. First, top-up with Alby, and then head over to the R-Weekly Highlights podcast entry on the index.A new way to think about value: https://value4value.info Get in touch with us on social media Eric Nantz: @rpodcast@podcastindex.social (Mastodon) and @theRcast (X/Twitter) Mike Thomas: @mikethomas@fosstodon.org (Mastodon) and @mikeketchbrook (X/Twitter) Music credits powered by OCRemixThe Amazon Session - Ducktales - Gux - https://ocremix.org/remix/OCR00402Doomsday - Sonic & Knuckles - elzfernomusic - https://ocremix.org/remix/OCR02532
Highlights from this week's conversation include:Clint's Background and Journey in Data (0:51)Starting a Data Career (2:01)Transition to Startup SaaS World (4:27)Clint's Connection to a Federal Reserve Database (5:31)Challenges in Predictive Modeling (10:27)Data Input Challenges (15:50)Marketers' Workflow and Data Integration (18:29)Soft ROI vs. Hard ROI in Data Analysis (00:21:31)Balancing Internal Marketing and Data Team's Value (22:35)Simplifying Data Inputs for Predictive Models (25:09)Data Analysis Workflow and Tech Stack (29:06)Open Data Formats and Impact on Data Platforms (34:40)The S3 and Ecosystem Model (37:08)In-browser SQL Queries with DuckDB (39:24)Data Security Concerns and Solutions (41:47)Clean Rooms and Data Sharing (43:32)Final Thoughts and Takeaways (47:35)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Topics covered in this episode: PSF Elections coming up Cloud engineer gets 2 years for wiping ex-employer's code repos Python: Import by string with pkgutil.resolve_name() DuckDB goes 1.0 Extras Joke Watch on YouTube About the show Sponsored by ScoutAPM: pythonbytes.fm/scout Connect with the hosts Michael: @mkennedy@fosstodon.org Brian: @brianokken@fosstodon.org Show: @pythonbytes@fosstodon.org Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesdays at 10am PT. Older video versions available there too. Finally, if you want an artisanal, hand-crafted digest of every week of the show notes in email form? Add your name and email to our friends of the show list, we'll never share it. Brian #1: PSF Elections coming up This is elections for the PSF Board and for 3 bylaw changes. To vote in the PSF election, you need to be a Supporting, Managing, Contributing, or Fellow member of the PSF, … And affirm your voting status by June 25. See Affirm your PSF Membership Voting Status for more details. Timeline Board Nominations open: Tuesday, June 11th, 2:00 pm UTC Board Nominations close: Tuesday, June 25th, 2:00 pm UTC Voter application cut-off date: Tuesday, June 25th, 2:00 pm UTC same date is also for voter affirmation. Announce candidates: Thursday, June 27th Voting start date: Tuesday, July 2nd, 2:00 pm UTC Voting end date: Tuesday, July 16th, 2:00 pm UTC See also Thinking about running for the Python Software Foundation Board of Directors? Let's talk! There's still one upcoming office hours session on June 18th, 12 PM UTC And For your consideration: Proposed bylaws changes to improve our membership experience 3 proposed bylaws changes Michael #2: Cloud engineer gets 2 years for wiping ex-employer's code repos Miklos Daniel Brody, a cloud engineer, was sentenced to two years in prison and a restitution of $529,000 for wiping the code repositories of his former employer in retaliation for being fired. The court documents state that Brody's employment was terminated after he violated company policies by connecting a USB drive. Brian #3: Python: Import by string with pkgutil.resolve_name() Adam Johnson You can use pkgutil.resolve_name("[HTML_REMOVED]:[HTML_REMOVED]")to import classes, functions or modules using strings. You can also use importlib.import_module("[HTML_REMOVED]") Both of these techniques are so that you have an object imported, but the end thing isn't imported into the local namespace. Michael #4: DuckDB goes 1.0 via Alex Monahan The cloud hosted product @MotherDuck also opened up General Availability Codenamed "Snow Duck" The core theme of the 1.0.0 release is stability. Extras Brian: Sending us topics. Please send before Tuesday. But any time is welcome. NumPy 2.0 htmx 2.0.0 Michael: Get 6 months of PyCharm Pro for free. Just take a course (even a free one) at Talk Python Training. Then visit your account page > details tab and have fun. Coming soon at Talk Python: Shiny for Python Joke: .gitignore thoughts won't let me sleep
Fredrik snackar Facebooks svar på HTMX, Microsofts Recall-fiasko, och actions på retrospekt. Som start blir det lite snack utifrån att spara på serverns resurser. Fredrik tar upp lite återkoppling på avsnittet om en värld utan React och hittade ett underbart kaninhål Facebooks HTMX-lika spår Primer. HTMX, från 2010! Ja, varför inte? Har någon av oss fler användare idag än Facebook hade 2010? Samtidigt som det lockar går det att se förklaringar till att det blev React istället för Primer. I alla fall för Facebooks del. Därefter lite snabba poddtips, och den stora frågan vad som är överingenjörande. Har ni konferenstips? Eller listor på konferenser? Lite tidig Øredev-pepp framförs, programmet är släppt och vi mottar gärna tips på folk att snacka med och frågor att ställa. Något som inte är så peppande: Microsofts lokala inspelnings- och sökfunktion Recall är en katastrof. Varför har det ens kunnat få utannonseras i det tillstånd den första versionen var? Varför ska externa experter ens behöva öppna munnen om saker som borde ha fångats upp och åtgärdats internt? Sist men inte minst: retrospekt! Developers med flera har snackat om retrospekt, och Fredrik funderar mest på hur man får ihop bra actions som blir gjorda och tar saker framåt. Ett stort tack till Cloudnet som sponsrar vår VPS! Har du kommentarer, frågor eller tips? Vi är @kodsnack, @thieta, @krig, och @bjoreman på Mastodon, har en sida på Facebook och epostas på info@kodsnack.se om du vill skriva längre. Vi läser allt som skickas. Gillar du Kodsnack får du hemskt gärna recensera oss i iTunes! Du kan också stödja podden genom att ge oss en kaffe (eller två!) på Ko-fi, eller handla något i vår butik. Länkar Kodsnack 580 - En värld utan React Bartek HTMX Bloggposten om Primer JSConf-presentationen om Primer Makinde Adeagbo Primer i en Github-gist Andreas Ekeroot 587 - senaste Kodsnack Developer voices med Duckdb-skaparen Hannes Mühleisen Duckdb Kodsnack på Ko-fi Svenska utvecklarpoddar-listan på Linkedin, postad av Cecilia Wirén Justin Hall links.net Snack overflow Stack overflow Snack overflow om överingenjörande Babel jq Avsnittet om jq rq, yq, och xq - några andra *q-verktyg fq Baader-Meinhof-fenomenet Tobbe 583 - Avsnittet med Tobbe om Redwood Jsday Grusp Øredev-programmet Kent Beck Leandro Riot Emil 573 - Riot-avsnittet Webbhuset Daniel Stenberg Curl Video av presentationen HTTP/3 Kodsnack 331 - Med Daniel om HTTP/3 Microsoft Recall - skrivet innan alla problem började dokumenteras Rewind Rewind om hur de spelade in säkert Kevin Beaumont om bristerna i Recall Microsoft ändrar lite i Recall Windows hello Retrospekt Developers! Developers! om retrospekt GTD WWDC Malin och Kai och Mercury weather Titlar HTML över linan Spara servern Servern kan det här med data Så samtida Vi har inte mindre Javascript idag Klick på länkar Ett baslager med Javascript 300 ingenjörer på samma webbsida Utvecklarsveriges mest kände doldis Datasäkerhetsfrågetecken Små, atomära, entydiga En actionpunkt som är mer ett projekt Mina retrospektpunkter
This is a recap of the top 10 posts on Hacker News on June 3rd, 2024.This podcast was generated by wondercraft.ai(00:30): How many photons are received per bit transmitted from Voyager 1?Original post: https://news.ycombinator.com/item?id=40561872&utm_source=wondercraft_ai(02:36): Hacking millions of modems and investigating who hacked my modemOriginal post: https://news.ycombinator.com/item?id=40570781&utm_source=wondercraft_ai(04:15): I Am So Sick of Leetcode-Style InterviewsOriginal post: https://news.ycombinator.com/item?id=40571395&utm_source=wondercraft_ai(05:39): Diffusion on syntax trees for program synthesisOriginal post: https://news.ycombinator.com/item?id=40569531&utm_source=wondercraft_ai(07:22): If English was written like Chinese (1999)Original post: https://news.ycombinator.com/item?id=40565060&utm_source=wondercraft_ai(08:51): FBI Raids Big Corporate Landlord over Nationwide Rent HikesOriginal post: https://news.ycombinator.com/item?id=40562834&utm_source=wondercraft_ai(10:33): Why YC went to DCOriginal post: https://news.ycombinator.com/item?id=40564639&utm_source=wondercraft_ai(12:08): DuckDB 1.0.0Original post: https://news.ycombinator.com/item?id=40562342&utm_source=wondercraft_ai(13:31): What if they gave an Industrial Revolution and nobody came? (2023)Original post: https://news.ycombinator.com/item?id=40562741&utm_source=wondercraft_ai(15:14): Oldest largest German Minecraft server shut down and open sourced everythingOriginal post: https://news.ycombinator.com/item?id=40566533&utm_source=wondercraft_aiThis is a third-party project, independent from HN and YC. Text and audio generated using AI, by wondercraft.ai. Create your own studio quality podcast with text as the only input in seconds at app.wondercraft.ai. Issues or feedback? We'd love to hear from you: team@wondercraft.ai
Retrouvez les liens de cet épisode dans les shownotes sur https://bigdatahebdo.com/podcast/episode-196-python-news-et-autres/------------------Cette publication est sponsorisée par Datatask et CerenIT.CerenIT vous accompagne pour concevoir, industrialiser ou automatiser vos plateformes mais aussi pour faire parler vos données temporelles. Ecrivez nous à contact@cerenit.fr et retrouvez-nous aussi au Time Series France.Datatask vous accompagne dans tous vos projets Cloud et Data, pour Imaginer, Expérimenter et Executer vos services ! Consulter le blog de Datatask pour en savoir plus. On recrute ! Venez cruncher de la data avec nous ! Ecrivez nous à recrutement@affini-tech.comLe générique a été composé et réalisé par Maxence Lecointe
In this episode, we sat down with Tomasz Tunguz (https://twitter.com/ttunguz), the founder of Theory Ventures and a leading voice in the tech investment space. We discussed the transformative potential of Ethereum as a database company, the importance of data security in a decentralized world, and the evolving landscape of AI technologies from foundational models to AI-native applications.
An aesthetically-pleasing journey through the history of R, another demonstration of DuckDB's power with analytics, and how webR with shinylive brings new learning life to the Pharmaverse TLG gallery.Episode LinksThis week's curator: Sam Parmar - @parmsam@fosstodon.org (Mastodon) & @parmsam_ (X/Twitter)The Aesthetics Wiki - an R AddendumR Dplyr vs. DuckDB - How to Enhance Your Data Processing Pipelines with R DuckDBTLG Catalog
Ibis is a Python library that offers a single data-frame API, from Python, which can run your queries on many different backends. These include databases like Postgres, but also commercial vendors like BigQuery and Snowflake. This ability to control multiple backends from a single API has a lot of use-cases, as well as maintainer challenges, all of which are discussed in this episode. To learn more about Ibis, check out the docs here: https://ibis-project.org/ If you're attending PyCon US this year, you may be interested in Philip's talk: https://us.pycon.org/2024/schedule/presentation/55/ During the podcast, Philip also mentioned a blogpost about DuckDB, here: https://ibis-project.org/posts/why-duckdb/ There was also a dogfooding blogpost, which is this one: https://ibis-project.org/posts/ci-analysis/
How do you debug your EF queries? Carl and Richard talk to Giorgi Dalakishvili about his open-source Visual Studio extension, EFCore Visualizer. Giorgi talks about bringing together the EF rendering of the query with the database query plan to ensure you retrieve data from your database as efficiently as possible. The conversation ranges over a number of tools Giorgi has built over the years, including EF Framework Exceptions, DuckDB.NET, and more!
Redis' re-licensing prompts forks like Drew DeVault's Redict, Matthew Miller thinks we need more community built software, Paul Gross makes the case that DuckDB is the new jq, Anton Zhiyanov shares how he makes a living as a developer despite being “pretty dumb” & Baldur Bjarnason chimes in on the state of the web developer job market.
Redis' re-licensing prompts forks like Drew DeVault's Redict, Matthew Miller thinks we need more community built software, Paul Gross makes the case that DuckDB is the new jq, Anton Zhiyanov shares how he makes a living as a developer despite being “pretty dumb” & Baldur Bjarnason chimes in on the state of the web developer job market.
Redis' re-licensing prompts forks like Drew DeVault's Redict, Matthew Miller thinks we need more community built software, Paul Gross makes the case that DuckDB is the new jq, Anton Zhiyanov shares how he makes a living as a developer despite being “pretty dumb” & Baldur Bjarnason chimes in on the state of the web developer job market.
NService Bus This episode of The Modern .NET Show is supported, in part, by NServiceBus, the ultimate tool to build robust and reliable systems that can handle failures gracefully, maintain high availability, and scale to meet growing demand. Make sure you click the link in the show notes to learn more about NServiceBus. Show Notes Yeah. So what I was thinking the other day is that what we want is to concentrate on the business logic that we need to implement and spend as small as little time as possible configuring, installing and figuring out the tools and libraries that we are using for this specific task. Like our mission is to produce the business logic and we should try to minimize the time that we spend on the tools and libraries that enable us to build the software. —Giorgi Dalakishvili Welcome to The Modern .NET Show! Formerly known as The .NET Core Podcast, we are the go-to podcast for all .NET developers worldwide and I am your host Jamie "GaProgMan" Taylor. In this episode, I spoke with Giorgi Dalakishvili about Postgresql, DuckDB, and where you might use either of them in your applications. As Giorgi points out, .NET has support for SQL Server baked in, but there's also support for other database technologies too: Yes, there are many database technologies and just like you, for me, SQL Server was the default go to database for quite a long time because it's from Microsoft. All the frameworks and libraries work with SQL Server out of the box, and have usually better support for SQL Server than for other databases. But recently I have been diving into Postgresql, which is a free database and I discovered that it has many interesting features and I think that many .NET developers will be quite excited about these features. The are very useful in some very specific scenarios. And it also has a very good support for .NET. Nowadays there is a .NET driver for Postgres, there is a .NET driver for Entity Framework core. So I would say it's not behind SQL server in terms of .NET support or feature wise. —Giorgi Dalakishvili He also points out that our specialist skill as developers is not to focus on the tools, libraries, and frameworks, but to use what we have in our collective toolboxes to build the business logic that our customers, clients, and users desire of us. And along the way, he drops some knowledge on an essential NuGet package for those of us who are using Entity Framework.. So let's sit back, open up a terminal, type in dotnet new podcast and we'll dive into the core of Modern .NET. Supporting the Show If you find this episode useful in any way, please consider supporting the show by either leaving a review (check our review page for ways to do that), sharing the episode with a friend or colleague, buying the host a coffee, or considering becoming a Patron of the show. Full Show Notes The full show notes, including links to some of the things we discussed and a full transcription of this episode, can be found at: https://dotnetcore.show/season-6/from-net-to-DuckDB-unleashing-the-database-evolution-with-giorgi-dalakishvili/ Useful Links Giorgi's GitHub DuckDB .NET Driver Postgres Array data type Postgres Range data type DuckDB DbUpdateException EntityFramework.Exceptions JsonB data type Vector embeddings Cosine similarity Vector databases: Chroma qdrant pgvector pgvector .NET library OLAP queries parquet files Dapper DuckDB documentation Dapr DuckDB Wasm; run DuckDB in your browser GitHub Codespaces Connecting with Giorgi: on Twitter on LinkedIn on his website Supporting the show: Leave a rating or review Buy the show a coffee Become a patron Getting in touch: via the contact page joining the Discord Music created by Mono Memory Music, licensed to RJJ Software for use in The Modern .NET Show Remember to rate and review the show on Apple Podcasts, Podchaser, or wherever you find your podcasts, this will help the show's audience grow. Or you can just share the show with a friend. And don't forget to reach out via our Contact page. We're very interested in your opinion of the show, so please get in touch. You can support the show by making a monthly donation on the show's Patreon page at: https://www.patreon.com/TheDotNetCorePodcast.
This is a recap of the top 10 posts on Hacker News on March 21st, 2024.This podcast was generated by wondercraft.ai(00:34): U.S. sues Apple, accusing it of maintaining an iPhone monopolyOriginal post: https://news.ycombinator.com/item?id=39778999&utm_source=wondercraft_ai(02:04): Difftastic, a structural diff tool that understands syntaxOriginal post: https://news.ycombinator.com/item?id=39778412&utm_source=wondercraft_ai(04:09): The baffling intelligence of a single cell: The story of E. coli chemotaxisOriginal post: https://news.ycombinator.com/item?id=39777229&utm_source=wondercraft_ai(06:02): The RedditsOriginal post: https://news.ycombinator.com/item?id=39778590&utm_source=wondercraft_ai(07:45): Hackers found a way to open any of 3M hotel keycard locksOriginal post: https://news.ycombinator.com/item?id=39779291&utm_source=wondercraft_ai(09:18): DuckDB as the New jqOriginal post: https://news.ycombinator.com/item?id=39782356&utm_source=wondercraft_ai(11:06): Ludic: New framework for Python with seamless Htmx supportOriginal post: https://news.ycombinator.com/item?id=39776199&utm_source=wondercraft_ai(12:53): GoFetch: New side-channel attack using data memory-dependent prefetchersOriginal post: https://news.ycombinator.com/item?id=39779195&utm_source=wondercraft_ai(14:34): Ikigai: What We Got Wrong and How to Find Meaning in LifeOriginal post: https://news.ycombinator.com/item?id=39777896&utm_source=wondercraft_ai(16:04): Research shows plant-based polymers can disappear within seven monthsOriginal post: https://news.ycombinator.com/item?id=39777898&utm_source=wondercraft_aiThis is a third-party project, independent from HN and YC. Text and audio generated using AI, by wondercraft.ai. Create your own studio quality podcast with text as the only input in seconds at app.wondercraft.ai. Issues or feedback? We'd love to hear from you: team@wondercraft.ai
Talk Python To Me - Python conversations for passionate developers
Do you have data that you pull from external sources or is generated and appears at your digital doorstep? I bet that data needs processed, filtered, transformed, distributed, and much more. One of the biggest tools to create these data pipelines with Python is Dagster. And we are fortunate to have Pedram Navid on the show this episode. Pedram is the Head of Data Engineering and DevRel at Dagster Labs. And we're talking data pipelines this week at Talk Python. Episode sponsors Talk Python Courses Posit Links from the show Rock Solid Python with Types Course: training.talkpython.fm Pedram on Twitter: twitter.com Pedram on LinkedIn: linkedin.com Ship data pipelines with extraordinary velocity: dagster.io dagster-open-platform: github.com The Dagster Master Plan: dagster.io data load tool (dlt): dlthub.com DataFrames for the new era: pola.rs Apache Arrow: arrow.apache.org DuckDB is a fast in-process analytical database: duckdb.org Ship trusted data products faster: www.getdbt.com Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy
Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design Interview Introduction How did you get involved in the area of data management? Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines? This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture? Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components? One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform? Can you describe the operational/architectural aspects of building a full data engine on top of the FDAP stack? What are the elements of the overall product/user experience that you had to build to create a cohesive platform? What are some of the other tools/technologies that can benefit from some or all of the pieces of the FDAP stack? What are the pieces of the Arrow ecosystem that are still immature or need further investment from the community? What are the most interesting, innovative, or unexpected ways that you have seen parts or all of the FDAP stack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on/with the FDAP stack? When is the FDAP stack the wrong choice? What do you have planned for the future of the InfluxDB IOx engine and the FDAP stack? Contact Info LinkedIn (https://www.linkedin.com/in/pauldix/) pauldix (https://github.com/pauldix) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links FDAP Stack Blog Post (https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/) Apache Arrow (https://arrow.apache.org/) DataFusion (https://arrow.apache.org/datafusion/) Arrow Flight (https://arrow.apache.org/docs/format/Flight.html) Apache Parquet (https://parquet.apache.org/) InfluxDB (https://www.influxdata.com/products/influxdb/) Influx Data (https://www.influxdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/influxdb-timeseries-data-platform-episode-199) Rust Language (https://www.rust-lang.org/) DuckDB (https://duckdb.org/) ClickHouse (https://clickhouse.com/) Voltron Data (https://voltrondata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Velox (https://github.com/facebookincubator/velox) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Trino (https://trino.io/) ODBC == Open DataBase Connectivity (https://en.wikipedia.org/wiki/Open_Database_Connectivity) GeoParquet (https://github.com/opengeospatial/geoparquet) ORC == Optimized Row Columnar (https://orc.apache.org/) Avro (https://avro.apache.org/) Protocol Buffers (https://protobuf.dev/) gRPC (https://grpc.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing Interview Introduction How did you get involved in the area of data management? Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation? What is the current state of the ecosystem for data sharing protocols/practices/platforms? What are some of the main challenges/shortcomings that teams/organizations experience with these options? What are the technical capabilities that need to be present for an effective data sharing solution? How does that change as a function of the type of data? (e.g. tabular, image, etc.) What are the requirements around governance and auditability of data access that need to be addressed when sharing data? What are the typical boundaries along which data access requires special consideration for how the sharing is managed? Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform? What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing? When is Bobsled the wrong choice? What do you have planned for the future of data sharing? Contact Info LinkedIn (https://www.linkedin.com/in/andyjefferson/?originalSubdomain=de) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Bobsled (https://www.bobsled.co/) OLAP == OnLine Analytical Processing (https://en.wikipedia.org/wiki/Online_analytical_processing) Cassandra (https://cassandra.apache.org/_/index.html) Podcast Episode (https://www.dataengineeringpodcast.com/cassandra-global-scale-database-episode-220) Neo4J (https://neo4j.com/) FTP == File Transfer Protocol (https://en.wikipedia.org/wiki/File_Transfer_Protocol) S3 Access Points (https://aws.amazon.com/s3/features/access-points/) Snowflake Sharing (https://docs.snowflake.com/en/guides-overview-sharing) BigQuery Sharing (https://cloud.google.com/bigquery/docs/authorized-datasets) Databricks Delta Sharing (https://www.databricks.com/product/delta-sharing) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
SQLite could do with a little competition, so when I invited the co-creator of DuckDB in to talk, I thought we'd be discussing the perils of trying to build a new in-process database engine. I quickly realised things went much deeper than just a tech refresh.Hannes Mühleisen joins me this week to blend his academic credentials as a database researcher with his vehement need to make that research practical. And so we dive into what modern database literature has to say on making queries faster, more parallelizable, and closer to the metal, and how it all comes together in a user-friendly package that's found its way into my day-to-day workload, and might well help out yours.If you're curious about the gory details of database queries, how they can take advantage of modern hardware, or how all that research actually turns into a useful tool, Hannes has some great answers.--DuckDB: https://duckdb.org/Database Systems Book: http://infolab.stanford.edu/~ullman/dscb.htmlKris' first computer: https://en.wikipedia.org/wiki/File:ZX_Spectrum_Plus2_(retouched).jpgVolcano Query Evaluation System [pdf]: https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdfMorsel Query Engine [pdf]: https://cs.brown.edu/~kayhan/papers/morsel_cp.pdfUnnesting Arbitrary Queries [pdf]: https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdfPapers Hannes' team have published: https://duckdb.org/why_duckdb#peer-reviewed-papers-and-thesis-worksDuckDB on Mastodon: https://mastodon.social/@duckdbKris on Twitter: https://twitter.com/krisajenkinsKris on LinkedIn: https://www.linkedin.com/in/krisjenkins/Kris on Mastodon: https://mastodon.social/@krisajenkins--#softwaredevelopment #podcast #programming #database #duckdb #sql #sqlite
Topics covered in this episode: Leaving the cloud PEP 723 - Inline script metadata Flet for Android harlequin: The SQL IDE for Your Terminal. Extras Joke Watch on YouTube About the show Sponsored by Bright Data : pythonbytes.fm/brightdata Connect with the hosts Michael: @mkennedy@fosstodon.org Brian: @brianokken@fosstodon.org Show: @pythonbytes@fosstodon.org Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too. Michael #1: Leaving the cloud Also see Five values guiding our cloud exit We value independence above all else. We serve the internet. We spend our money wisely. We lead the way. We seek adventure. And We stand to save $7m over five years from our cloud exit Slice our new monster 192-thread Dell R7625s into isolated VMs Which added a combined 4,000 vCPUs with 7,680 GB of RAM and 384TB of NVMe storage to our server capacity They created Kamal — Deploy web apps anywhere A lot of these ideas have changed how I run the infrastructure at Talk Python and for Python Bytes. Brian #2: PEP 723 - Inline script metadata Author: Ofek Lev This PEP specifies a metadata format that can be embedded in single-file Python scripts to assist launchers, IDEs and other external tools which may need to interact with such scripts. Example: # /// script # requires-python = ">=3.11" # dependencies = [ # "requests<3", # "rich", # ] # /// import requests from rich.pretty import pprint resp = requests.get("https://peps.python.org/api/peps.json") data = resp.json() pprint([(k, v["title"]) for k, v in data.items()][:10]) Michael #3: Flet for Android via Balázs Remember Flet? Here's a code sample (scroll down a bit). It's amazing but has been basically impossible to deploy. Now we have Android. Here's a good YouTube video showing the build process for APKs. Brian #4: harlequin: The SQL IDE for Your Terminal. Ted Conbeer & other contributors Works with DuckDB and SQLite Speaking of SQLite Jeff Triplett and warnings of using Docker and SQLite in production Anže's post and and article: Django, SQLite, and the Database is Locked Error Extras Brian: Recent Python People episodes Will Vincent Julian Sequeira Pamela Fox Michael: PageFind and how I'm using it When "Everything" Becomes Too Much: The npm Package Chaos of 2024 Essay: Unsolicited Advice for Mozilla and Firefox SciPy 2024 is coming to Washington Joke: Careful with that bike lock combination code
Summary The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm welcoming back Tarush Aggarwal to talk about what he and his team at 5x data are building to improve the user experience of the modern data stack. Interview Introduction How did you get involved in the area of data management? Can you describe what 5x is and the story behind it? We last spoke in March of 2022. What are the notable changes in the 5x business and product? What are the notable shifts in the data ecosystem that have influenced your adoption and product direction? What trends are you most focused on tracking as you plan the continued evolution of your offerings? What are the points of friction that teams run into when trying to build their data platform? Can you describe design of the system that you have built? What are the strategies that you rely on to support adaptability and speed of onboarding for new integrations? What are some of the types of edge cases that you have to deal with while integrating and operating the platform implementations that you design for your customers? What is your process for selection of vendors to support? How would you characterize your relationships with the vendors that you rely on? For customers who have pre-existing investment in a portion of the data stack, what is your process for engaging with them to understand how best to support their goals? What are the most interesting, innovative, or unexpected ways that you have seen 5XData used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on 5XData? When is 5X the wrong choice? What do you have planned for the future of 5X? Contact Info LinkedIn (https://www.linkedin.com/in/tarushaggarwal/) @tarush (https://twitter.com/tarush) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links 5X (https://5x.co) Informatica (https://www.informatica.com/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Looker (https://cloud.google.com/looker/) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) Redshift (https://aws.amazon.com/redshift/) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Rudderstack (https://www.rudderstack.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rudderstack-open-source-customer-data-platform-episode-263/) Peak.ai (https://peak.ai/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). That's three free boards at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics Interview Introduction How did you get involved in the area of data management? Can you describe what Anomstack is and the story behind it? What are your goals for this project? What other tools/products might teams be evaluating while they consider Anomstack? In the context of Anomstack, what constitutes a "metric"? What are some examples of useful metrics that a data team might want to monitor? You put in a lot of work to make Anomstack as easy as possible to get started with. How did this focus on ease of adoption influence the way that you approached the overall design of the project? What are the core capabilities and constraints that you selected to provide the focus and architecture of the project? Can you describe how Anomstack is implemented? How have the design and goals of the project changed since you first started working on it? What are the steps to getting Anomstack running and integrated as part of the operational fabric of a data platform? What are the sharp edges that are still present in the system? What are the interfaces that are available for teams to customize or enhance the capabilities of Anomstack? What are the most interesting, innovative, or unexpected ways that you have seen Anomstack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomstack? When is Anomstack the wrong choice? What do you have planned for the future of Anomstack? Contact Info LinkedIn (https://www.linkedin.com/in/andrewm4894/) Twitter (https://twitter.com/@andrewm4894) GitHub (http://github.com/andrewm4894) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Anomstack Github repo (http://github.com/andrewm4894/anomstack) Airflow Anomaly Detection Provider Github repo (https://github.com/andrewm4894/airflow-provider-anomaly-detection) Netdata (https://www.netdata.cloud/) Metric Tree (https://www.datacouncil.ai/talks/designing-and-building-metric-trees) Semantic Layer (https://en.wikipedia.org/wiki/Semantic_layer) Prometheus (https://prometheus.io/) Anodot (https://www.anodot.com/) Chaos Genius (https://www.chaosgenius.io/) Metaplane (https://www.metaplane.dev/) Anomalo (https://www.anomalo.com/) PyOD (https://pyod.readthedocs.io/) Airflow (https://airflow.apache.org/) DuckDB (https://duckdb.org/) Anomstack Gallery (https://github.com/andrewm4894/anomstack/tree/main/gallery) Dagster (https://dagster.io/) InfluxDB (https://www.influxdata.com/) TimeGPT (https://docs.nixtla.io/docs/timegpt_quickstart) Prophet (https://facebook.github.io/prophet/) GreyKite (https://linkedin.github.io/greykite/) OpenLineage (https://openlineage.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)