Podcasts about apache iceberg

  • 53PODCASTS
  • 131EPISODES
  • 42mAVG DURATION
  • 1EPISODE EVERY OTHER WEEK
  • May 18, 2026LATEST

POPULARITY

20192020202120222023202420252026


Best podcasts about apache iceberg

Latest podcast episodes about apache iceberg

Streaming Audio: a Confluent podcast about Apache Kafka
How AI Is Changing Apache Iceberg with Russell Spitzer | Ep. 30

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later May 18, 2026 40:57


Adi Polak talks to Russell Spitzer (Snowflake) about his career in open source data infrastructure. Russell's first job: software engineer in test at DataStax. His challenge: making Apache Iceberg ready for AI and streaming.SEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo  

Patoarchitekci
Google Cloud Next '26

Patoarchitekci

Play Episode Listen Later May 15, 2026 12:32


“Google Cloud Next - nudno jak nie wiem co.” Szymon o 260 ogłoszeniach, w których słowo agent pada częściej niż litera D. Rebranding Vertexa w Gemini Enterprise Agent Platform brzmi jak generator nazw na pełnych obrotach, ale pod marketingowym szumem są konkrety: TPU Gen 8 z natywnym PyTorch bez przerabiania kodu (“po latach zrozumieli, że nie każdy jest Googlem”) i Apache Iceberg do query'owania danych z innych cloudów.

AWS - Il podcast in italiano
Le novità nel mondo storage

AWS - Il podcast in italiano

Play Episode Listen Later May 4, 2026 33:44


Cos'è S3 Vectors e perché semplifica radicalmente la gestione dei dati vettoriali per le applicazioni AI? Cosa sono i table bucket e in che modo Amazon S3 Tables porta le tabelle Apache Iceberg direttamente dentro S3? Come funziona il Tag-Based Access Control su S3? Perché il Post-Quantum TLS è una feature da attivare subito per chi gestisce dati sensibili? Oggi parliamo di questo ed altro con Antonio Aga Rossi, Principal Solutions Architect di AWS.Link utili:- Amazon S3 Vectors- Amazon S3 Tables- Introducing attribute-based access control for Amazon S3 general purpose buckets- Using hybrid post-quantum TLS with Amazon S3

ai cosa storage perch aws oggi mondo cos s3 novit tls amazon s3 principal solutions architect apache iceberg
Cloud Wars Live with Bob Evans
Tirthankar Lahiri on Why Agentic AI Must Live Inside the Database | Cloud Wars Live

Cloud Wars Live with Bob Evans

Play Episode Listen Later Apr 28, 2026 21:44


In this episode of Cloud Wars Live, Bob Evans sits down with Tirthankar Lahiri, Senior Vice President for Mission-Critical Data and AI Engines. Lahiri explains how agentic AI is transforming enterprise applications from simple question-answer systems into action-driven platforms that can reason, remember, and securely execute tasks. He details Oracle's strategy around unified agent memory, private agent factories, deep data security, and open development standards, all designed to help customers build scalable, secure, and flexible AI systems without added cost. AI Built Securely The Big Themes: Agentic AI Becomes Action-Oriented: Tirthankar Lahiri explains that agentic AI represents the next major step beyond generative AI. While generative AI focused largely on answering questions and producing content, agentic AI is designed to take action. It allows businesses to build systems that can reason, decide, and execute tasks autonomously. Oracle sees this as the future of application development, where AI becomes embedded into workflows rather than functioning as a standalone tool. Oracle Builds AI Directly Into the Database: Rather than forcing customers to move data across multiple isolated systems, Oracle's approach is to bring AI directly to the data. Lahiri argues that data is the “ground truth” and moving it creates technical debt, silos, inefficiency, and security vulnerabilities. Oracle's converged database architecture supports multiple data types, including relational, graph, spatial, and vector, inside one unified environment. This eliminates the need for separate repositories and allows AI agents to access all relevant context without fragmentation. Deep Data Security Protects Against AI Risks: Lahiri strongly emphasizes that traditional application-layer security is no longer enough in the age of AI. Since AI can generate SQL and potentially bypass interface restrictions through prompt injection, businesses must secure data directly at the source. Oracle calls this “deep data security.” He uses the analogy of protecting valuables in a safe bolted to the floor rather than simply locking the front gate. Even if someone gets inside the house, the valuables remain protected. Similarly, Oracle enforces security policies at the database level, ensuring agents can only access data users are authorized to see. The Big Quote: "You need to secure data. Need to lock your valuables into the safe deep inside the house." More from Tirthankar Lahiri and Oracle: Connect with Lahiri on LinkedIn or learn more about Oracle AI Database. Visit Cloud Wars for more.

Intervista Pythonista
Apache Iceberg. Python e Caffè

Intervista Pythonista

Play Episode Listen Later Apr 18, 2026 12:55


Marco e Cesare esplorano Apache Iceberg, il formato di tabella open source che sta ridefinendo il modo in cui le aziende gestiscono i dati su scala petabyte. Scopriremo come funziona la sua architettura, perché giganti come Netflix e Apple lo hanno adottato, e perché è diventato lo standard de facto nella moderna data engineering.

Techzine Talks
Snowflake brengt AI-modellen naar je data binnen je veilige klantomgeving

Techzine Talks

Play Episode Listen Later Mar 30, 2026 49:26


Martin Frederik, country manager Snowflake Benelux, vertelt over hoe Snowflake AI-modellen zoals Claude, GPT-4 en Mistral direct naar je data brengt, zonder dat de data je veilige omgeving hoeft te verlaten. Hij bespreekt de praktische toepassingen van AI in call centers, engineering en business intelligence, en legt uit hoe bedrijven kunnen kiezen tussen verschillende modellen op basis van kwaliteit en kosten.De discussie gaat over de uitdagingen van data-integratie, het belang van open source-formaten zoals Apache Iceberg, en hoe Snowflake bedrijven helpt om onafhankelijker te worden. Ook komen AI agents, automatisering en de spanning tussen innovatie en security aan bod, evenals de vraag naar datasoevereiniteit in Europa.Belangrijke onderwerpen:• Hoe AI-modellen binnen Snowflake draaien bij je eigen data• Praktische AI-toepassingen in productiviteit en automatisering• Data silo's versus centrale data platforms• Security, governance en data soevereiniteit• De toekomst van AI agents en agentic workflows• Open source strategie en vendor lock-in voorkomenChapters:0:08 - Introductie: data en Snowflake0:47 - Markt ontwikkelingen en AI impact2:40 - AI-modellen binnen Snowflake10:47 - Data silo's en integratie18:02 - Praktische AI-toepassingen28:50 - AI agents en automatisering42:33 - Data soevereiniteit en securityKeywords: Snowflake, data platform, AI modellen, machine learning, data warehouse, cloud computing, data security, data soevereiniteit, AI agents, Anthropic Claude, OpenAI, business intelligence, data governance

The Analytics Engineering Podcast
The Iceberg ecosystem today (w/ Anders Swanson)

The Analytics Engineering Podcast

Play Episode Listen Later Mar 8, 2026 54:51


Tristan sits down with Anders Swanson, a developer experience advocate at dbt Labs, to talk about the state of the Apache Iceberg ecosystem. They unpack the "open standards" shift, define the core building blocks (query engines, object stores, catalogs), and dig into why external catalogs have become a fourth namespace tier across platforms. Anders outlines a pragmatic, phased adoption model for Iceberg integrations, explains why metadata performance and resiliency are hard requirements, and clarifies why vended credentials exist and what they solve. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

GOTO - Today, Tomorrow and the Future
Building Planetary-Scale Data Systems with Venice • Felix GV & Olimpiu Pop

GOTO - Today, Tomorrow and the Future

Play Episode Listen Later Mar 3, 2026 28:38


This interview was recorded for GOTO Unscripted.https://gotopia.techCheck out more here:https://gotopia.tech/articles/421Félix GV - Current Interests: Multi-Planetary Databases, Data Sovereignty & LifeloggingOlimpiu Pop - Technologist & Tech JournalistRESOURCESFélixhttps://bsky.app/profile/felixgv.ninjahttps://github.com/FelixGVhttps://www.linkedin.com/in/felixgvOlimpiuhttps://x.com/olimpiupophttps://github.com/zrollhttps://www.linkedin.com/in/olimpiupopLinkshttps://venicedb.orghttps://github.com/linkedin/venicehttps://rocksdb.orghttps://duckdb.orgDESCRIPTIONFélix GV, a former engineer at LinkedIn and architect of the Venice database system, discusses the complexity of building planetary-scale data systems. He explains Venice's unbundled architecture where each component—from Kafka-based pub/sub to RocksDB-powered servers—operates as an independent distributed system. Félix details their rigorous chaos engineering practices, including regular load tests that push data centers beyond normal capacity to ensure reliability.The discussion covers fundamental distributed systems concepts like the CAP theorem and the trade-offs between consistency and availability in multi-region deployments. He also explains why Venice, as a derived data system, deliberately sacrifices strong consistency for high throughput and availability, and concludes by discussing their experimental integration of DuckDB for SQL-based analytics and data exploration capabilities.RECOMMENDED BOOKSKasun Indrasiri & Danesh Kuruppu • gRPC: Up and Running • https://amzn.to/3sBGBJJTomer Shiran, Jason Hughes & Alex Merced • Apache Iceberg: The Definitive Guide • https://amzn.to/488Z30kWilliam Smith • Arrow Flight Protocols and Practices • https://amzn.to/4o2Q2fdAdi Polak • Scaling Machine Learning with Spark • https://amzn.to/3N9vx1HMark Needham, Michael Hunger & Michael Simons • DuckDB in Action • https://amzn.to/45QwSliSimon Aubury & Ned Letcher • Getting Started with DuckDB • https://amzn.to/3VPk4qBlueskyInstagramLinkedInFacebookCHANNEL MEMBERSHIP BONUSJoin this channel to get early access to videos & other perks:https://www.youtube.com/channel/UCs_tLP3AiwYKwdUHpltJPuA/joinLooking for a unique learning experience?Attend the next GOTO conference near you! Get your ticket: gotopia.techSUBSCRIBE TO OUR YOUTUBE CHANNEL - new videos posted daily!

The Analytics Engineering Podcast
Apache Iceberg and the catalog layer (w/ Russell Spitzer)

The Analytics Engineering Podcast

Play Episode Listen Later Jan 25, 2026 53:29


Tristan talks with Russell Spitzer, a PMC member of Apache Iceberg and principal engineer at Snowflake, about the evolution of open table formats and the catalog layer. They dig into identity and access at the catalog layer and why consensus‑driven standards make interoperability possible. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

GOTO - Today, Tomorrow and the Future
Building Modern Databases with the FDAP Stack • Andrew Lamb & Olimpiu Pop

GOTO - Today, Tomorrow and the Future

Play Episode Listen Later Jan 20, 2026 30:22


This interview was recorded for GOTO Unscripted.https://gotopia.techCheck out more here:https://gotopia.tech/articles/412Andrew Lamb - Staff Engineer at InfluxData, ASF Member & PMC Apache DataFusion & Apache ArrowOlimpiu Pop - Technologist & Tech JournalistRESOURCESAndrewhttps://bsky.app/profile/andrewlamb1111.bsky.socialhttps://x.com/andrewlamb1111https://github.com/alambhttps://www.linkedin.com/in/andrewalambhttps://andrew.nerdnetworks.orgOlimpiuhttps://x.com/olimpiupophttps://github.com/zrollhttps://www.linkedin.com/in/olimpiupopLinkshttps://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdbhttps://www.cidrdb.org/cidr2005/papers/P19.pdfDESCRIPTIONOlimpiu Pop speaks with Andrew Lamb, staff engineer at InfluxData and PMC member of Apache DataFusion and Apache Arrow, about how modern data systems are built using standardized open source components rather than being developed from scratch.Andrew discusses the FDAP Stack (Flight, DataFusion, Arrow & Parquet), the shift from row-based to columnar data storage, and how technologies like Apache Iceberg are enabling a new era of interoperability across data platforms. The discussion covers why this modular approach saves years of development time while providing better performance and compatibility.RECOMMENDED BOOKSKasun Indrasiri & Danesh Kuruppu • gRPC: Up and Running • https://amzn.to/3sBGBJJTomer Shiran, Jason Hughes & Alex Merced • Apache Iceberg: The Definitive Guide • https://amzn.to/488Z30kWilliam Smith • Arrow Flight Protocols and Practices • https://amzn.to/4o2Q2fdMatthew Topol • In-Memory Analytics with Apache Arrow • https://amzn.to/4oJQ6BMApache Parquet A Complete Guide • https://amzn.to/4i7HVN6BlueskyTwitterInstagramLinkedInFacebookCHANNEL MEMBERSHIP BONUSJoin this channel to get early access to videos & other perks:https://www.youtube.com/channel/UCs_tLP3AiwYKwdUHpltJPuA/joinLooking for a unique learning experience?Attend the next GOTO conference near you! Get your ticket: gotopia.techSUBSCRIBE TO OUR YOUTUBE CHANNEL - new videos posted daily!

running modern practices arrow stack databases pmc nosql parquet data processing apache kafka duckdb influxdata jason hughes apache arrow apache iceberg andrew lamb database development
The Ravit Show
Inside Iceberg: Real Recovery Paths Across S3, DynamoDB, And Iceberg

The Ravit Show

Play Episode Listen Later Jan 19, 2026 6:33


At AWS re:Invent I spoke to Woon Ho Jung, CTO for Cloud Native at Commvault, to talk about how they are helping AWS customers protect more than just one type of workloadWe spoke about how they started with BackTrack for S3 and now support DynamoDB and Apache Iceberg, and what real problem that solves when your data is spread across so many services!For teams who are new to Apache Iceberg on AWS, I asked Woon to break down the basics. What do you need in place so that recovery is not a theory, but something you can rely on when a table, job, or pipeline goes wrong!If you care about resilience across modern AWS workloads, this one will be worth watching.#data #ai #awsreinvent #aws #agents Amazon Web Services (AWS) AWS Partners AWS Events #awspartners #awscompetencypartners #agenticai #theravitshow

Engineering Kiosk
#248 Data as a Product: Die Struktur & Skalierung von Data-Teams mit Mario Müller von Veeva

Engineering Kiosk

Play Episode Listen Later Dec 30, 2025 78:44 Transcription Available


Data as a Product: Was steckt dahinter?Warum ist AI überall, aber der Weg von der Datenbank zu "Wow, das Modell kann das" wirkt oft wie ein schwarzes Loch? Du loggst brav Events, die Daten landen in irgendwelchen Silos, und trotzdem bleibt die entscheidende Frage offen: Wer sorgt eigentlich dafür, dass aus Rohdaten ein zuverlässiges, verkaufbares Datenprodukt wird.In dieser Episode machen wir genau dort das Licht an. Gemeinsam mit Mario Müller, Director of Data Engineering bei Veeva Systems, schauen wir uns an, was Datenteams wirklich sind, wie "Data as a Product" in der Praxis funktioniert und warum Data Engineering mehr ist als nur ein paar CSVs über FTP zu schubsen. Wir sprechen über Teamstrukturen von der One-Man-Show bis zur cross-functional Squad, über Ownership auf den Daten, Data Governance und darüber, wie du Datenqualität wirklich misst, inklusive Monitoring, Alerts, SQL-Regeln und menschlicher Quality Control.Dazu gibt es eine ordentliche Portion Tech: Spark, AWS S3 als primärer Speicher, Delta Lake, Athena, Glue, Airflow, Push-Pull statt Event-Overkill und die Entscheidung für Batch Processing, obwohl alle Welt nach Streaming ruft.Und natürlich klären wir auch, was passiert, wenn KI an den Daten rumfummelt: Wo AI beim Bootstrapping hilft, warum Production und Scale tricky werden und wieso Verantwortlichkeit beim Commit nicht von einem LLM übernommen wird.Wenn du Datenteams aufbauen willst, Data Products liefern musst oder einfach verstehen willst, wie aus Daten verlässlicher Business-Impact wird, bist du hier genau richtig.Bonus: Batchjobs bekommen heute mal ein kleines Comeback.Unsere aktuellen Werbepartner findest du auf https://engineeringkiosk.dev/partnersDas schnelle Feedback zur Episode:

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
2025 Reflections, Google Antigravity, NotebookLM, Dremio AI Agent, Pangolin Catalog, Dremioframe & Iceframe Python Libraries for Apache Iceberg

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Play Episode Listen Later Dec 23, 2025 19:00


Alex Merced (AlexMerced.com) discusses:– his thoughts on thriving in 2026– His use of Google’s Antigravity– His use of NotebookLM– Dremio’s AI tools (dremio.com)– More Check out Pangolin Catalog at PangolinCatalog.org

The Analytics Engineering Podcast
Inside Snowflake's AI roadmap (w/ Chris Child)

The Analytics Engineering Podcast

Play Episode Listen Later Dec 14, 2025 57:27


Snowflake VP of Product Management Chris Child joins Tristan Handy to unpack Snowflake's AI roadmap and what it means for data teams. They discuss the evolution from Snowpark to Cortex and Snowflake Intelligence, how to govern agents with row- and column-level controls, and why Snowflake is investing in Apache Iceberg and the Open Semantic Interchange initiative (dbt Labs recently open sourced MetricsFlow, the technology that powers the dbt Semantic Layer, to align with the goals of OSI). Chris also shares a vision for the next five years of data engineering: fewer bespoke pipelines, more standardization and semantics, and a bigger focus on business context and data products. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

ai roadmap labs snowflakes cortex osi apache iceberg chris child
Cloud Wars Live with Bob Evans
Oracle's Juan Loaiza Discusses Trust Privacy, Security in the Age of AI | Cloud Wars Live

Cloud Wars Live with Bob Evans

Play Episode Listen Later Nov 3, 2025 18:56


Juan Loaiza is the EVP of Database Technologies at Oracle. In today's special episode of Cloud Wars Live, Loaiza joins Bob Evans to discuss how AI is transforming the way businesses interact with data. He spotlights Oracle's new AI-native database, the importance of trust and security in enterprise AI, and why business users now play a bigger role in data strategy. It's a revealing look at how Oracle is shaping the future of intelligent data systems.The AI Data RevolutionThe Big Themes:Trust, Governance, and Privacy Must Be Built Into the AI‑Data Stack: One of the strongest points made by Loaiza is about the risk of AI in enterprises: hallucinations, mis‑use of data, privacy violations, regulatory consequences. When mission‑critical systems (hospitals, banks, telecoms) are involved, errors are unacceptable and can be illegal. Oracle's approach is to embed privacy and access controls down into the database engine: the system knows who the end user is, what they can see, and ensures AI cannot leak unauthorized data.Multi‑Cloud, On‑Premises, Hybrid — Customers Want Flexibility: Loaiza describes how Oracle is enabling customers to run their database and AI workloads wherever they need: on‑premises, in public clouds (AWS, Azure, Google Cloud), or via “cloud at your data center” options like Exadata Cloud@Customer. This speaks to regulatory, latency, data sovereignty and operational constraints. For enterprises, the takeaway is that deployment flexibility is essential. A one‑size‑fits‑all cloud model may not meet strategic needs.Business Users and Developers Now Have Voices in Database Strategy: Historically, databases were the domain of DBAs, IT operations, and infrastructure teams. Now business users and developers also have meaningful voices because of AI democratizing access. This shift means organizational structures, roles and processes must change. Data governance, training, tool‑selection and deployment pipelines need to reflect that the “consumer” of the database is broader.The Big Quote: “[AI] can translate English to this language of computers, the language of data, which is SQL. So, what that means is you don't have to learn this crazy language anymore. So pretty much anyone, business people, lay people, can now talk using their normal natural language to the database, and the database will understand what they're saying and give them answers, build applications to all these and this is something I honestly never thought I'd see in my entire life, and it's here today."More from Juan Loaiza and Oracle:Follow Juan on LinkedIn or learn more about Oracle's approach to security. Visit Cloud Wars for more.

The Data Stack Show
Re-Air: The Data Economy: Turning Information into a Tradable Commodity with Viktor Kessler of Vakamo

The Data Stack Show

Play Episode Listen Later Oct 29, 2025 34:21


This episode is a re-air of one of our most popular conversations from this year, featuring insights worth revisiting. Thank you for being part of the Data Stack community. Stay up to date with the latest episodes at datastackshow.com. This week on The Data Stack Show, the crew brings you another conversation live from Data Council in Oakland, California. In this episode, Viktor Kessler from Vakamo explores the evolution of data architecture from rigid warehouses to flexible Lakehouse systems. Powered by Apache Iceberg, this new approach enables seamless data sharing, governance, and potential monetization. Viktor discusses how open-source innovation is transforming data management, highlighting the shift towards treating data as a product and the emerging potential for AI-driven data exchanges. The conversation provides insights into the future of decentralized, adaptable data infrastructure and so much more. Highlights from this week's conversation include:Viktor's Background and Journey in Data (1:20)Evolution of Data Architecture (4:41)The Lakehouse Concept (7:12)Open Source Innovation (11:05)Data Production and Decentralization (15:06)Governance in Decentralized Systems (18:53)Data Economy and Monetization (21:15)Security Concerns in Data Processing (24:21)Impact on Data Consumers (27:37)Compaction Issues in Data Tables (29:39)Open Source Lake Keeper Tool and Parting Thoughts (33:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Data Transforming Business
How enterprises can enable the Agentic AI Lakehouse on Apache Iceberg

Data Transforming Business

Play Episode Listen Later Oct 29, 2025 14:34


"A flaw of warehouses is that you need to move all your data into them so you can keep it going, and for a lot of organisations that's a big hassle,” says Will Martin, EMEA Evangelist at Dremio. “It can take a long time, it can be expensive, and you ultimately can end up ripping up processes that are there."In this episode of the Don't Panic It's Just Data podcast, recorded live at Big Data LDN (BDL) 2025, Will Martin, EMEA Evangelist at Dremio, joins Shubhangi Dua, Podcast Host and Tech Journalist at EM360Tech. They talk about how enterprises can enable the Agentic AI Lakehouse on Apache Iceberg and why query performance is critical for efficient data analysis. "If you have a data silo, it exists for a reason—something's feeding information to it. You usually have other processes feeding off of it. So if you shift all that to a warehouse, it disrupts a lot of your business," Martin tells Dua. This is where a lakehouse comes into play. Organisations can federate their access through a lakehouse data approach. They can centralise access to the respective organisation's lakehouse while keeping their data in its original location. Such a system helps people get started quickly.In terms of data quality, if you access everything from one location, even with separate data silos, you can see all your data. This visibility allows you to identify issues, address them, and enhance your data quality. That's beneficial for AI, too, Martin explains. Lakehouse Key to AI Infrastructure?Lakehouse has been recognised for unifying and simplifying governance. An imperative feature of a lakehouse is the data catalogue, which helps an organisation browse and find information. It also secures access and manages permissions."You can access in one place, but you can do all your security and permissions in one place rather than all these individual systems, which is great if you work in IT,” reflects Martin. "There are some drawbacks to lakehouses. So, a big component of a lakehouse is metadata. It can be quite big, and it needs managing. Certain companies and vendors are trying to deal with that."With AI and AI agents, it's become even harder to optimise analytics on a lakehouse. However, this has been improved as technical barriers are disappearing. Martin explains that anyone can prompt a question; for instance, an enterprise CEO could ask questions about the data and demand justifications directly. In the past, a request would have to be submitted, and then a data scientist or engineer would create the dataset and hand it over. Now, engineers' roles have changed to focus on better optimisation. They help queries run smoothly and ensure tables are efficient. Agents cannot assist with that.Also Listen: Dremio: The State of the Data LakehouseOptimise LakehouseVendors such as

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Lakehouse Catalogs Beyond Apache Iceberg, What could they Look Like?

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Play Episode Listen Later Oct 17, 2025


Alex Merced discusses different paths to a Universal Lakehouse Catalog standard and their pros and cons. Find links to books, social and more at AlexMerced.com

catalog alex merced apache iceberg
The ERP Advisor
The ERP Minute Episode 205 - September 23rd, 2025

The ERP Advisor

Play Episode Listen Later Sep 24, 2025 3:05


This week, Workday hosted their annual user conference, Workday Rising 2025, taking the opportunity to announce a number of new releases and updates. In other news, CrowdStrike and Salesforce announced a new strategic partnership to enhance the security of AI agents and applications built on Agentforce and the Salesforce platform. To round out the week, Qlik announced the general availability of Qlik Open Lakehouse, a fully managed Apache Iceberg service in Qlik Talent Cloud.Connect with us!https://www.erpadvisorsgroup.com866-499-8550LinkedIn:https://www.linkedin.com/company/erp-advisors-groupTwitter:https://twitter.com/erpadvisorsgrpFacebook:https://www.facebook.com/erpadvisorsInstagram:https://www.instagram.com/erpadvisorsgroupPinterest:https://www.pinterest.com/erpadvisorsgroupMedium:https://medium.com/@erpadvisorsgroup

The Analytics Engineering Podcast
Under the hood of Apache Iceberg (w/ Christian Thiel)

The Analytics Engineering Podcast

Play Episode Listen Later Aug 24, 2025 55:59


Tristan digs deep into the world of Apache Iceberg. There's a lot happening beneath the surface: multiple catalog interfaces, evolving REST specs, and competing implementations across open source, proprietary, and academic contexts. Christian Thiel, co-founder of Lakekeeper, one of the most widely used Iceberg catalogs, joins to walk through the state of the Iceberg ecosystem. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

alphalist.CTO Podcast - For CTOs and Technical Leaders
#124 - The Path to AGI: Inside poolside's AI Model Factory for Code with Eiso Kant

alphalist.CTO Podcast - For CTOs and Technical Leaders

Play Episode Listen Later Jun 27, 2025 63:56 Transcription Available


How do you build a foundation model that can write code at a human level? Eiso Kant (CTO & co-founder, Poolside) reveals the technical architecture, distributed team strategies, and reinforcement learning breakthroughs powering one of Europe's most ambitious AI startups. Learn how Poolside operates 10,000+ H200s, runs the world's largest code execution RL environment, and why CTOs must rethink engineering orgs for an agent-driven future.

WBSRocks: Business Growth with ERP and Digital Transformation
WBSP736: Grow Your Business by Learning from Enterprise Software Stories - Feb 2025, Ep 6, an Objective Panel Discussion

WBSRocks: Business Growth with ERP and Digital Transformation

Play Episode Listen Later Jun 24, 2025 62:18


Send us a textThe tech landscape is rapidly evolving as major players and emerging startups alike double down on AI-driven innovation and infrastructure transformation. Qlik's acquisition of Upsolver enhances real-time data ingestion for Apache Iceberg, while Epicor's new Prism Vertical AI Agents are reimagining how frontline workers interact with enterprise intelligence. In parallel, Apple's entry into a consortium focused on next-gen AI data centers highlights growing urgency around power and scalability, especially as experts predict data center energy demands will double within five years. Meanwhile, strategic moves like SAP's quantum computing ambitions, IBM's acquisition of AST to deepen Oracle capabilities, and startups like ThoughtSpot, Qbiq, and Vasco advancing AI-powered solutions for analytics, design, and revenue planning underscore a new era of intelligent, responsive enterprise tech.In today's episode, we invited a panel of industry analysts for a live discussion on LinkedIn to analyze current enterprise software stories. We covered many grounds, including the direction and roadmaps of each enterprise software vendor. Finally, we analyzed future trends and how they might shape the enterprise software industry.Background Soundtrack: Away From You – Mauro SommFor more information on growth strategies for SMBs using ERP and digital transformation, visit our community at wbs. rocks or elevatiq.com. To ensure that you never miss an episode of the WBS podcast, subscribe on your favorite podcasting platform. 

The Data Stack Show
249: Quacking Through Data: Duckdb's Emerging Ecosystem

The Data Stack Show

Play Episode Listen Later Jun 18, 2025 19:20


This week on The Data Stack Show, John Wessel and Matt Kelliher-Gibson dive into the recent Duck Lake announcement, exploring the evolving landscape of data analytics technologies. They discuss DuckDB's role as a lightweight, local analytics database and its potential as a caching layer for open table formats like Iceberg. The conversation also highlights the current state of data storage standards, focusing on agreements around Parquet and Iceberg, while noting the ongoing complexity in catalog management. Key takeaways include the importance of local compute solutions, the early stage of open table formats, and the potential for simplified data infrastructure that can provide faster, more cost-effective analytics workflows. The episode underscores the ongoing innovation in data technologies and the need for more streamlined, flexible data management solutions. Don't miss it!Highlights from this week's conversation include:Discussion on Duck Lake Announcement (1:41)Compatibility with Apache Iceberg (4:05)Use Cases for DuckDB (6:23)Concerns About Data Management (10:01)Introduction to Data Formats (11:40)Catalog Space Challenges (13:13)Metadata Orchestration (14:54)Simplicity in Data Management (15:25)SQL Demo Discussion (17:26)Wrap-Up and Final Thoughts (18:44)The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it's needed to power smarter decisions and better customer experiences. Each week, we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

OpenObservability Talks
ClickHouse: Breaking the Speed Limit for Observability and Analytics - OpenObservability Talks S5E12

OpenObservability Talks

Play Episode Listen Later May 27, 2025 58:27


The ClickHouse® project is a rising star in observability and analytics, challenging performance conventions with its breakneck speed. This open source OLAP column store, originally developed at Yandex to power their web analytics platform at massive scale, has quickly evolved into one of the hottest open source observability data stores around. Its published performance benchmarks have been the topic of conversation, outperforming many legacy databases and setting a new bar for fast queries over large volumes of data.Our guest for this episode is Robert Hodges, CEO of Altinity — the second largest contributor to the ClickHouse project. With over 30 years of experience in databases, Robert brings deep insights into how ClickHouse is challenging legacy databases at scale. We'll also explore Altinity's just-launched groundbreaking open source project—Project Antalya—which extends ClickHouse with Apache Iceberg shared storage, unlocking dramatic improvements in both performance and cost efficiency. Think 90% reductions in storage costs and 10 to 100x faster queries, all without requiring any changes to your existing applications.The episode was live-streamed on 20 May 2025 and the video is available at https://www.youtube.com/watch?v=VeyTL2JlWp0You can read the recap post: https://medium.com/p/2004160b2f5e/ OpenObservability Talks episodes are released monthly, on the last Thursday of each month and are available for listening on your favorite podcast app and on YouTube.We live-stream the episodes on Twitch and YouTube Live - tune in to see us live, and chime in with your comments and questions on the live chat.⁠⁠https://www.youtube.com/@openobservabilitytalks⁠  https://www.twitch.tv/openobservability⁠Show Notes:00:00 - Intro01:38 - ClickHouse elevator pitch02:46 - guest intro04:48 - ClickHouse under the hood08:15 - SQL and the database evolution path 11:20 - the return of SQL16:13 - design for speed 17:14 - use cases for ClickHouse19:18 - ClickHouse ecosystem22:22 - ClickHouse on Kubernetes 31:45 - know how ClickHouse works inside to get the most out of it 38:59 - ClickHouse for Observability46:58 - Project Antalya55:03 - Kubernetes 1.33 release55:32 - OpenSearch 3.0 release56:01 - New Permissive License for ML Models Announced by the Linux Foundation57:08 - OutroResources:ClickHouse on GitHub: https://github.com/ClickHouse/ClickHouse Shopify's Journey to Planet-Scale Observability: https://medium.com/p/9c0b299a04ddProject Antalya: https://altinity.com/blog/getting-started-with-altinitys-project-antalya https://cmtops.dev/posts/building-observability-with-clickhouse/ Kubernetes 1.33 release highlights: https://www.linkedin.com/feed/update/urn:li:activity:7321054742174924800/ New Permissive License for Machine Learning Models Announced by the Linux Foundation: https://www.linkedin.com/feed/update/urn:li:share:7331046183244611584  Opensearch 3.0 major release: https://www.linkedin.com/posts/horovits_opensearch-activity-7325834736008880128-kCqrSocials:Twitter:⁠ https://twitter.com/OpenObserv⁠YouTube: ⁠https://www.youtube.com/@openobservabilitytalks⁠Dotan Horovits============X (Twitter): @horovitsLinkedIn: www.linkedin.com/in/horovitsMastodon: @horovits@fosstodonBlueSky: @horovits.bsky.socialRobert Hodges=============LinkedIn: https://www.linkedin.com/in/berkeleybob2105/ 

The ERP Advisor
The ERP Minute Episode 187 - May 20th, 2025

The ERP Advisor

Play Episode Listen Later May 21, 2025 3:35


This week, Sage announced its financial results for the six months to March 31st, 2025, Workday announced a new wave of Illuminate Agents designed to speed up hiring processes, improve worker experiences, streamline financial processes, and empower employees, UKG launched UGK Bryte payroll AI agents for both the UKG Pro and UKG Ready suites, delivering new tools that help all employees from the frontline to payroll administrators, OneStream announced a series of powerful enhancements at its Splash 2025 user conference, and Qlik announced the launch of Qlik Open Lakehouse, a fully managed Apache Iceberg solution built into Qlik Talend Cloud.Connect with us!https://www.erpadvisorsgroup.com866-499-8550LinkedIn:https://www.linkedin.com/company/erp-advisors-groupTwitter:https://twitter.com/erpadvisorsgrpFacebook:https://www.facebook.com/erpadvisorsInstagram:https://www.instagram.com/erpadvisorsgroupPinterest:https://www.pinterest.com/erpadvisorsgroupMedium:https://medium.com/@erpadvisorsgroup

ai splash workday ukg qlik apache iceberg onestream
The Data Engineering Show
How Rising Wave Is Redefining Real-Time Data with Postgres Power

The Data Engineering Show

Play Episode Listen Later May 7, 2025 31:36


In this episode of The Data Engineering Show, the bros sit with Yingjun Wu, founder and CEO of Rising Wave, to explore the innovative world of stream processing systems. Yingjun shares his journey from academic research to creating a Postgres-compatible streaming system that drastically reduces resource usage. They discuss how Rising Wave's S3-based architecture and Postgres compatibility provide advantages over traditional systems like Flink, and explore the increasing role of Apache Iceberg in data pipelines.

The MAD Podcast with Matt Turck
Inside the Mind of Snowflake's CEO: Bold Bets in the AI Arms Race

The MAD Podcast with Matt Turck

Play Episode Listen Later Apr 10, 2025 83:41


In this episode, we sit down with Sridhar Ramaswamy, CEO of Snowflake, for an in-depth conversation about the company's transformation from a cloud analytics platform into a comprehensive AI data cloud. Sridhar shares insights on Snowflake's shift toward open formats like Apache Iceberg and why monetizing storage was, in his view, a strategic misstep.We also dive into Snowflake's growing AI capabilities, including tools like Cortex Analyst and Cortex Search, and discuss how the company scaled AI deployments at an impressive pace. Sridhar reflects on lessons from his previous startup, Neeva, and offers candid thoughts on the search landscape, the future of BI tools, real-time analytics, and why partnering with OpenAI and Anthropic made more sense than building Snowflake's own foundation models.SnowflakeWebsite - https://www.snowflake.comX/Twitter - https://x.com/snowflakedbSridhar RamaswamyLinkedIn - https://www.linkedin.com/in/sridhar-ramaswamyX/Twitter - https://x.com/RamaswmySridharFIRSTMARKWebsite - https://firstmark.comX/Twitter - https://twitter.com/FirstMarkCapMatt Turck (Managing Director)LinkedIn - https://www.linkedin.com/in/turck/X/Twitter - https://twitter.com/mattturck(00:00) Intro and current market tumult(02:48) The evolution of Snowflake from IPO to Today(07:22) Why Snowflake's earliest adopters came from financial services(15:33) Resistance to change and the philosophical gap between structured data and AI(17:12) What is the AI Data Cloud?(23:15) Snowflake's AI agents: Cortex Search and Cortex Analyst(25:03) How did Sridhar's experience at Google and Neeva shape his product vision?(29:43) Was Neeva simply ahead of its time?(38:37) The Epiphany mafia(40:08) The current state of search and Google's conundrum(46:45) “There's no AI strategy without a data strategy”(56:49) Embracing Open Data Formats with Iceberg(01:01:45) The Modern Data Stack and the future of BI(01:08:22) The role of real-time data(01:11:44) Current state of enterprise AI: from PoCs to production(01:17:54) Building your own models vs. using foundation models(01:19:47) Deepseek and open source AI(01:21:17) Snowflake's 1M Minds program(01:21:51) Snowflake AI Hub

Knee-deep in Tech
Episode 301

Knee-deep in Tech

Play Episode Listen Later Apr 8, 2025 31:31


In this news episode, the trio explores the latest updates in the Windows Insider program. They also discuss how QR code authentication in Entra ID can simplify access for frontline workers in specific scenarios. In Microsoft Fabric, the focus is on integrating Apache Iceberg data with OneLake, along with notable improvements to External Data Sharing. Azure Stream Analytics now supports integration with Azure Event Hub Schema Registry. Lastly, the Azure Virtual Network Manager Network Verifier can be the tool to help gain visibility to your network connectivity in Azure. Hosted on Acast. See acast.com/privacy for more information.

Knee-deep in Tech
Episode 301

Knee-deep in Tech

Play Episode Listen Later Apr 8, 2025 31:31


In this news episode, the trio explores the latest updates in the Windows Insider program. They also discuss how QR code authentication in Entra ID can simplify access for frontline workers in specific scenarios. In Microsoft Fabric, the focus is on integrating Apache Iceberg data with OneLake, along with notable improvements to External Data Sharing. Azure Stream Analytics now supports integration with Azure Event Hub Schema Registry. Lastly, the Azure Virtual Network Manager Network Verifier can be the tool to help gain visibility to your network connectivity in Azure. Hosted on Acast. See acast.com/privacy for more information.

DataTalks.Club
Trends in Data Engineering – Adrian Brudaru

DataTalks.Club

Play Episode Listen Later Mar 7, 2025 56:59


In this podcast episode, we talked with Adrian Brudaru about ​the past, present and future of data engineering.About the speaker:Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted.As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.0:00 Introduction to DataTalks.Club1:05 Discussing trends in data engineering with Adrian2:03 Adrian's background and journey into data engineering5:04 Growth and updates on Adrian's company, DLT Hub9:05 Challenges and specialization in data engineering today13:00 Opportunities for data engineers entering the field15:00 The "Modern Data Stack" and its evolution17:25 Emerging trends: AI integration and Iceberg technology27:40 DuckDB and the emergence of portable, cost-effective data stacks32:14 The rise and impact of dbt in data engineering34:08 Alternatives to dbt: SQLMesh and others35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions37:20 Audience questions: Career focus in data roles and AI engineering overlaps39:00 The role of semantics in data and AI workflows41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Alex Merced discusses the idea of whether Apache Iceberg and Delta Lake could merge. Follow my blog: https://medium.alexmerced.blog

Cloud Masters
Amazon S3 Tables explained: Better storage for AWS Analytics workloads

Cloud Masters

Play Episode Listen Later Jan 29, 2025 26:08


AWS Analytics expert Swapnil Bhoite joins us to break down of Amazon S3 Tables. From comparing Parquet and Apache Iceberg formats to explaining critical features like compaction and snapshot management, Swapnil explores how this fully-managed service streamlines data lake operations. Learn when to adopt S3 Tables, understand its cost-performance benefits, and discover key migration considerations from existing Glue catalog implementations — essential knowledge for teams looking to scale their analytics workloads on AWS.

Cloud Masters
Amazon S3 Tables explained: Better storage for AWS Analytics workloads

Cloud Masters

Play Episode Listen Later Jan 29, 2025 26:08


AWS Analytics expert Swapnil Bhoite joins us to break down of Amazon S3 Tables. From comparing Parquet and Apache Iceberg formats to explaining critical features like compaction and snapshot management, Swapnil explores how this fully-managed service streamlines data lake operations. Learn when to adopt S3 Tables, understand its cost-performance benefits, and discover key migration considerations from existing Glue catalog implementations — essential knowledge for teams looking to scale their analytics workloads on AWS.

The ERP Advisor
The ERP Minute Episode 170 - January 21st, 2025

The ERP Advisor

Play Episode Listen Later Jan 22, 2025 3:05


As we continue to charge into 2025, major AI announcements came from Epicor, Oracle, and Microsoft. First, Epicor launched Epicor Prism, a network of vertical AI agents built specifically for the supply chain industries. Then, Oracle and Adarga, a leader in AI-driven information intelligence, announced they are partnering to bring Adarga's Vantage software to Oracle Cloud Infrastructure and Oracle's distributed cloud. Finally, Microsoft and Pearson, the world's lifelong learning company, announced a strategic collaboration on Tuesday to help address one of the top challenges facing organizations globally: skilling for the era of AI. To round out the week, Qlik announced the acquisition of Upsolver, a pioneer in real-time data streaming and Apache Iceberg optimization.Connect with us!https://www.erpadvisorsgroup.com866-499-8550LinkedIn:https://www.linkedin.com/company/erp-advisors-groupTwitter:https://twitter.com/erpadvisorsgrpFacebook:https://www.facebook.com/erpadvisorsInstagram:https://www.instagram.com/erpadvisorsgroupPinterest:https://www.pinterest.com/erpadvisorsgroupMedium:https://medium.com/@erpadvisorsgroup

AWS Morning Brief
A Return to Greatness, or Degenerate Day 3?

AWS Morning Brief

Play Episode Listen Later Dec 9, 2024 18:39


AWS Morning Brief for the week of December 9, with Corey Quinn. Links:AWS announces access to VPC resources over AWS PrivateLinkAnnouncing Amazon Aurora DSQL (Preview)Announcing Amazon Bedrock IDE in preview as part of Amazon SageMaker Unified StudioAWS announces Amazon CloudWatch Database InsightsAmazon DynamoDB global tables previews multi-Region strong consistencyAmazon EC2 introduces Allowed AMIs to enhance AMI governanceAnnouncing Amazon EC2 I8g instancesAnnouncing Amazon EKS Auto ModeAnnouncing Amazon EKS Hybrid NodesAnnouncing Amazon Elastic VMware Service (Preview)Announcing Amazon FSx Intelligent-Tiering, a new storage class for FSxAmazon Q Developer can now automate code reviewsAmazon Q Developer announces automatic unit test generation to accelerate feature developmentAmazon S3 adds new default data integrity protectionsAnnouncing Amazon S3 Metadata (Preview) – Easiest and fastest way to manage your metadataAmazon S3 launches storage classes for AWS Dedicated Local ZonesAnnouncing Amazon S3 Tables – Fully managed Apache Iceberg tables optimized for analytics workloadsAWS announces Amazon SageMaker LakehouseAWS Control Tower launches managed controls using declarative policiesAWS announces AWS Data Transfer Terminal for high-speed data uploadsAmazon Web Services announces declarative policiesIntroducing AWS Glue 5.0AWS announces Invoice ConfigurationAWS Marketplace now offers EC2 Image Builder components from independent software vendorsAWS announces AWS Security Incident Response for general availabilityAnnouncing AWS Transfer Family web appsBuy with AWS accelerates solution discovery and procurement on AWS Partner websitesOracle Database@AWS is now in limited previewPartyRock improves app discovery and announces upcoming free daily useAnnouncing the preview of Amazon SageMaker Unified StudioVPC Lattice now includes TCP support with VPC ResourcesAnnouncing the 2024 Geo and Global AWS Partners of the YearAmazon MemoryDB Multi-Region is now generally availableTop announcements of AWS re:Invent 2024SponsorThe Duckbill Group: https://www.duckbillgroup.com/

The Cloudcast
Data Lakehouses & Apache Iceberg

The Cloudcast

Play Episode Listen Later Oct 16, 2024 28:35


Alex Merced (@AMdatalakehouse, Senior Tech Evangelist, @dremio) talks about everything data and we dig deep into Apache Iceberg and DataLakehouses.SHOW: 865Want to go to All Things Open in Raleigh for FREE? (Oct 27th-29th)We are offering 5 Free passes, first come, first serve for the Cloudcast Community -> Registration Link Instructions:Click reg linkClick “Get Tickets”Choose ticket optionProceed with registration (discount will automatically be applied, cost will be $0)SHOW TRANSCRIPT: The Cloudcast #865 TranscriptSHOW VIDEO: https://youtube.com/@TheCloudcastNET CLOUD NEWS OF THE WEEK: - http://bit.ly/cloudcast-cnotwNEW TO CLOUD? CHECK OUT OUR OTHER PODCAST: - "CLOUDCAST BASICS" SHOW NOTES:Dremio (homepage)Hands-on with Apache Iceberg TutorialApache Iceberg Crash CourseData Lakehouses and Apache Hudi (Cloudcast Eps. 694)Apache Iceberg, the Definitive Guide (eBook)Apache Iceberg (homepage)Iceberg + Nessie Catalog (homepage)Iceberg + Polaris Catalog (homepage)AlexMerced.comDataLakehouseHub.comTopic 1 - Welcome to the show. Tell us a little bit about your background. Topic 2 - It's been a little while since we talked about Data Lakehouses, can you give us a little bit of background on this space, and what the most recent dynamics are around these technologies.Topic 3 - What are the typical integrations with a Data Lakehouse? How are users/developers typically interacting with Data Lakehouse technologies? [The marketplace for Iceberg catalogs like Nessie and Polaris]Topic 4 - How does an open data format like Apache Iceberg fit into the bigger picture of data lakehouses, or large scale stores of data? Topic 5 - How does Dremio enable Iceberg? How does Dremio sit in the intersection of Data Lakehouse, Data Mesh and Data Virtualization trends all of which come from the same fundamental problem, the growing scale of data use cases.Topic 6 -  We've seen companies start to rethink their data in the cloud strategies. Are you seeing on-premises making a comeback for large data applicationsFEEDBACK?Email: show at the cloudcast dot netTwitter: @cloudcastpodInstagram: @cloudcastpodTikTok: @cloudcastpod

ai data hands big data raleigh iceberg data mesh alex merced apache iceberg data lakehouse dremio all things open
Tech Disruptors
Starburst CEO on Future Of Data-Query Engines

Tech Disruptors

Play Episode Listen Later Oct 8, 2024 44:40


The highly distributed, dispersed and dynamic nature of enterprise data fuels demand for robust data-query engines for analytics and to drive intelligence. In this episode of the Tech Disruptors podcast, Starburst CEO Justin Borgman joins Sunil Rajgopal, senior software analyst at Bloomberg Intelligence, to discuss the shifting landscape for these products. They examine the future of data solutions, the evolving competitive landscape and developers' embrace of open-table formats like Apache Iceberg, with Borgman saying this was “the summer of Iceberg.” The two also talk about Starburst's product journey, competition with Dremio and Snowflake, and corporate IT-spending momentum.

Over The Edge
Leveraging Open Source Technologies for Data Lakehouses with Alex Merced, Senior Tech Evangelist at Dremio

Over The Edge

Play Episode Listen Later Oct 2, 2024 44:01


What makes data lakehouses a game changer in modern data management? In this episode, Bill sits down with Alex Merced, Senior Tech Evangelist at Dremio, to explore the evolution of data lakehouses and their role in bridging the gap between data lakes and data warehouses. Alex breaks down the components of data lakehouses and dives into the rise of Apache Iceberg.---------Key Quotes:“I love just get really deep into technology, really see what it does. And then scream at the rooftops how cool it is. And basically that was my charter. And [Apache] Iceberg, the more I learned about it, the more I realized this is really interesting.”“Interoperability and data. Basically, a lot of the things that kept data in silos is now breaking apart.”"So here we're talking about something that's going to be a standard. And that's when I think of the highest levels of openness matter because if it's something that a whole industry is going to build on, it should be something that the whole industry has to say in its evolution…And that's the beauty of openness that it does create these nice sort of places where we can collaborate and compete together.”--------Timestamps: (01:32) How Alex got started in his career(03:54) Breaking down data lakehouses(07:08) The idea behind an open data lakehouse(10:10) Alex's involvement with Apache Iceberg(15:13) Key components of a data lakehouse(23:41) The growth of Apache Iceberg(32:07) Dremio's Apache Iceberg crash course(38:43) Explaining self-service analytics--------Sponsor:Over the Edge is brought to you by Dell Technologies to unlock the potential of your infrastructure with edge solutions. From hardware and software to data and operations, across your entire multi-cloud environment, we're here to help you simplify your edge so you can generate more value. Learn more by visiting dell.com/edge for more information or click on the link in the show notes.--------Credits:Over the Edge is hosted by Bill Pfeifer, and was created by Matt Trifiro and Ian Faison. Executive producers are Matt Trifiro, Ian Faison, Jon Libbey and Kyle Rusca. The show producer is Erin Stenhouse. The audio engineer is Brian Thomas. Additional production support from Elisabeth Plutko.--------Links:Follow Bill on LinkedInFollow Alex on LinkedIn

Data Protection Gumbo
267: Why the Data Lakehouse Is the Future—But What's Stopping It from Getting There? - Upsolver

Data Protection Gumbo

Play Episode Listen Later Oct 1, 2024 26:37


Ori Rafael, CEO and co-founder of Upsolver explores the future of data management through data lakehouses. He explains the evolution of the lakehouse, a revolutionary architecture that combines the best of data lakes and warehouses. You will gain insights into key technologies like Apache Iceberg, how lakehouses enable advanced use cases such as AI, and how they help businesses reduce costs.

ceo ai cybersecurity stopping big data apache iceberg data lakehouse
AWS Podcast
#681: Amazon DynamoDB Deep Dive

AWS Podcast

Play Episode Listen Later Aug 19, 2024 48:56


Simon is joined by Jason Hunter, AWS Principal Specialist Solutions Architect, do dive super-deep into how to make the most of DynamoDB. Whether you are new to DynamoDB, or have been using it for years - there is something in this episode for everyone! Shownotes: Jason's Blog Posts: https://aws.amazon.com/blogs/database/author/jzhunter/ The Apache Iceberg blog: https://aws.amazon.com/blogs/database/use-amazon-dynamodb-incremental-export-to-update-apache-iceberg-tables/ Traffic spikes (on-demand vs provisioned): https://aws.amazon.com/blogs/database/handle-traffic-spikes-with-amazon-dynamodb-provisioned-capacity/ Cost-effective bulk actions like delete: https://aws.amazon.com/blogs/database/cost-effective-bulk-processing-with-amazon-dynamodb/ A deep dive on partitions: https://aws.amazon.com/blogs/database/part-1-scaling-dynamodb-how-partitions-hot-keys-and-split-for-heat-impact-performance/ Global tables prescriptive guidance (the 25 page deep dive): https://docs.aws.amazon.com/prescriptive-guidance/latest/dynamodb-global-tables/introduction.html

global cost deep dive traffic blog posts dynamodb apache iceberg jason hunter amazon dynamodb
Partially Redacted: Data Privacy, Security & Compliance
What is a Data Lakehouse with Upsolver's Ori Rafael

Partially Redacted: Data Privacy, Security & Compliance

Play Episode Listen Later Aug 14, 2024 31:59


In this episode, we sit down with Ori Rafael, CEO and Co-founder of Upsolver, to explore the rise of the lakehouse architecture and its significance in modern data management. Ori breaks down the origins of the lakehouse and how it leverages S3 to provide scalable and cost-effective storage. We discuss the critical role of open table formats like Apache Iceberg in unifying data lakes and warehouses, and how ETL processes differ between these environments. Ori also shares his vision for the future, highlighting how Upsolver is positioned to empower organizations as they navigate the rapidly evolving data landscape.

ceo ori s3 etl apache iceberg data lakehouse
Data Engineering Podcast
Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Podcast

Play Episode Listen Later Jun 30, 2024 59:48


Summary This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigor Interview Introduction How did you get involved in the area of data management? Can you describe what Synq is and the story behind it? Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address? Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary? What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team? How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach? With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance? Can you describe how Synq is designed/implemented? How have the scope and goals of the product changed since you first started working on it? For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows? What are the types of incidents/errors that you are able to identify and alert on? What does a typical incident/error resolution process look like with Synq? What are the most interesting, innovative, or unexpected ways that you have seen Synq used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq? When is Synq the wrong choice? What do you have planned for the future of Synq? Contact Info LinkedIn (https://www.linkedin.com/in/petr-janda/?originalSubdomain=dk) Substack (https://substack.com/@petrjanda) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Synq (https://www.synq.io/) Incident Management (https://www.pagerduty.com/resources/learn/what-is-incident-management/) SLA == Service Level Agreement (https://en.wikipedia.org/wiki/Service-level_agreement) Data Governance (https://en.wikipedia.org/wiki/Data_governance) Podcast Episode (https://www.dataengineeringpodcast.com/nicola-askham-practical-data-governance-episode-428) PagerDuty (https://www.pagerduty.com/) OpsGenie (https://www.atlassian.com/software/opsgenie) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) SQLMesh (https://sqlmesh.readthedocs.io/en/stable/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering Podcast
Stitching Together Enterprise Analytics With Microsoft Fabric

Data Engineering Podcast

Play Episode Listen Later Jun 23, 2024 53:22


Summary Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou Interview Introduction How did you get involved in the area of data management? Can you describe what Microsoft Fabric is and the story behind it? Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution? What are the elements of Fabric that were engineered specifically for the service? What are the most interesting/complicated integration challenges? How has your prior experience with Ahana and Presto informed your current work at Microsoft? AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine? What are the challenges in terms of safety and reliability? What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? When is Fabric the wrong choice? What do you have planned for the future of data lake analytics? Contact Info LinkedIn (https://www.linkedin.com/in/diptiborkar/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Microsoft Fabric (https://www.microsoft.com/microsoft-fabric) Ahana episode (https://www.dataengineeringpodcast.com/ahana-presto-cloud-data-lake-episode-217) DB2 Distributed (https://www.ibm.com/docs/en/db2/11.5?topic=managers-designing-distributed-databases) Spark (https://spark.apache.org/) Presto (https://prestodb.io/) Azure Data (https://azure.microsoft.com/en-us/products#analytics) MAD Landscape (https://mattturck.com/mad2024/) Podcast Episode (https://www.dataengineeringpodcast.com/mad-landscape-2023-data-infrastructure-episode-369) ML Podcast Episode (https://www.themachinelearningpodcast.com/mad-landscape-2023-ml-ai-episode-21) Tableau (https://www.tableau.com/) dbt (https://www.getdbt.com/) Medallion Architecture (https://dataengineering.wiki/Concepts/Medallion+Architecture) Microsoft Onelake (https://learn.microsoft.com/fabric/onelake/onelake-overview) ORC (https://orc.apache.org/) Parquet (https://parquet.incubator.apache.org) Avro (https://avro.apache.org/) Delta Lake (https://delta.io/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) Hadoop (https://hadoop.apache.org/) PowerBI (https://www.microsoft.com/power-platform/products/power-bi) Podcast Episode (https://www.dataengineeringpodcast.com/power-bi-business-intelligence-episode-154) Velox (https://velox-lib.io/) Gluten (https://gluten.apache.org/) Apache XTable (https://xtable.apache.org/) GraphQL (https://graphql.org/) Formula 1 (https://www.formula1.com/) McLaren (https://www.mclaren.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering Podcast
Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

Play Episode Listen Later Jun 16, 2024 53:19


Summary Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse Interview Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture? What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure? What were the requirements and selection criteria that led to the selection of that combination of technologies? What are the other systems that feed into and rely on the Trino/Iceberg service? what kinds of questions are you answering with table metadata what use case/team does that support comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe? Contact Info Substack (https://kevinjqliu.substack.com) LinkedIn (https://www.linkedin.com/in/kevinjqliu) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Trino (https://trino.io/) Iceberg (https://iceberg.apache.org/) Stripe (https://stripe.com/) Spark (https://spark.apache.org/) Redshift (https://aws.amazon.com/redshift/) Hive Metastore (https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore) Python Iceberg (https://py.iceberg.apache.org/) Python Iceberg REST Catalog (https://github.com/kevinjqliu/iceberg-rest-catalog) Trino Metadata Table (https://trino.io/docs/current/connector/iceberg.html#metadata-tables) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) Tabular (https://tabular.io/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Delta Table (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Databricks Unity Catalog (https://www.databricks.com/product/unity-catalog) Starburst (https://www.starburst.io/) AWS Athena (https://aws.amazon.com/athena/) Kevin Trinofest Presentation (https://trino.io/blog/2023/07/19/trino-fest-2023-stripe.html) Alluxio (https://www.alluxio.io/) Podcast Episode (https://www.dataengineeringpodcast.com/alluxio-distributed-storage-episode-70) Parquet (https://parquet.incubator.apache.org/) Hudi (https://hudi.apache.org/) Trino Project Tardigrade (https://trino.io/blog/2022/05/05/tardigrade-launch.html) Trino On Ice (https://www.starburst.io/blog/iceberg-table-partitioning/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering Podcast
X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

Play Episode Listen Later Jun 9, 2024 42:22


Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ronen Korman and Stav Elkayam about pulling back the curtain on your real-time data streams by bringing intuitive observability to Flink streams Interview Introduction How did you get involved in the area of data management? Can you describe what Datorios is and the story behind it? Data observability has been gaining adoption for a number of years now, with a large focus on data warehouses. What are some of the unique challenges posed by Flink? How much of the complexity is due to the nature of streaming data vs. the architectural realities of Flink? How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it? How have the requirements of generative AI shifted the demand for streaming data systems? What role does Flink play in the architecture of generative AI systems? Can you describe how Datorios is implemented? How has the design and goals of Datorios changed since you first started working on it? How much of the Datorios architecture and functionality is specific to Flink and how are you thinking about its potential application to other streaming platforms? Can you describe how Datorios is used in a day-to-day workflow for someone building streaming applications on Flink? What are the most interesting, innovative, or unexpected ways that you have seen Datorios used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datorios? When is Datorios the wrong choice? What do you have planned for the future of Datorios? Contact Info Ronen LinkedIn (https://www.linkedin.com/in/ronen-korman/) Stav LinkedIn (https://www.linkedin.com/in/stav-elkayam-118a2795/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Datorios (https://datorios.com/) Apache Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) ChatGPT-4o (https://openai.com/index/hello-gpt-4o/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

ai data search chatgpt trusted doordash python comcast hive originalsubdomain hug red hat starburst flink trino x ray vision jamie parker apache flink apache iceberg stream processing freak fandango orchestra
Data Engineering Podcast
Practical First Steps In Data Governance For Long Term Success

Data Engineering Podcast

Play Episode Listen Later Jun 2, 2024 60:40


Summary Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Nicola Askham about the practical steps of building out a data governance practice in your organization Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the scope and boundaries of data governance in an organization? At what point does a lack of an explicit governance policy become a liability? What are some of the misconceptions that you encounter about data governance? What impact has the evolution of data technologies had on the implementation of governance practices? (e.g. number/scale of systems, types of data, AI) Data governance can often become an exercise in boiling the ocean. What are the concrete first steps that will increase the success rate of a governance practice? Once a data governance project is underway, what are some of the common roadblocks that might derail progress? What are the net benefits to the data team and the organization when a data governance practice is established, active, and healthy? What are the most interesting, innovative, or unexpected ways that you have seen data governance applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data governance/training/coaching? What are some of the pitfalls in data governance? What are some of the future trends in data governance that you are excited by? Are there any trends that concern you? Contact Info Website (https://www.nicolaaskham.com/) LinkedIn (https://www.linkedin.com/in/nicolaaskham/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Website (https://www.nicolaaskham.com/) Master Data Management (https://en.wikipedia.org/wiki/Master_data_management) Cartesian Join (https://www.geeksforgeeks.org/cartesian-join/) DAMA == Data Management Community (https://www.dama.org/) DMBOK == Data Management Body of Knowledge (https://www.dama.org/cpages/body-of-knowledge) DAMA DMBOK Wheel (https://www.dama.org/cpages/dmbok-2-wheel-images) CDMP (Certified Data Management Professional) Exam (https://www.dama.org/cpages/cdmp-information) Data Mesh (https://www.datamesh-architecture.com/) Data Governance First Steps Checklist (https://www.nicolaaskham.com/free-data-governance-checklist) The Never Normal (https://www.linkedin.com/newsletters/the-never-normal-6862024032934477824/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering Podcast
Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

Play Episode Listen Later May 27, 2024 60:00


Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process Interview Introduction How did you get involved in the area of data management? Can you start by sharing some of your experiences with data migration projects? As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems? How would you categorize the different types and motivations of migrations? How does the motivation for a migration influence the ways that you plan for and execute that work? Can you talk us through one or two specific projects that you have taken part in? Part 1: The Triggers Section 1: Technical Limitations triggering Data Migration Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure Legacy compatibility: Difficulties integrating with modern tools and cloud platforms System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade) Section 2: Types of Migrations for Infrastructure Focus Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.) Data center migration: Physical relocation or consolidation of data centers Virtualization migration: Moving from physical servers to virtual machines (or vice versa) Section 3: Technical Decisions Driving Data Migrations End-of-life support: Forced migration when older software or hardware is sunsetted Security and compliance: Adopting new platforms with better security postures Cost Optimization: Potential savings of cloud vs. on-premise data centers Part 2: Challenges (and Anxieties) Section 1: Technical Challenges Data transformation challenges: Schema changes, complex data mappings Network bandwidth and latency: Transferring large datasets efficiently Performance testing and load balancing: Ensuring new systems can handle the workload Live data consistency: Maintaining data integrity while updates occur in the source system Minimizing Lag: Techniques to reduce delays in replicating changes to the new system Change data capture: Identifying and tracking changes to the source system during migration Section 2: Operational Challenges Minimizing downtime: Strategies for service continuity during migration Change management and rollback plans: Dealing with unexpected issues Technical skills and resources: In-house expertise/data teams/external help Section 3: Security & Compliance Challenges Data encryption and protection: Methods for both in-transit and at-rest data Meeting audit requirements: Documenting data lineage & the chain of custody Managing access controls: Adjusting identity and role-based access to the new systems Part 3: Patterns Section 1: Infrastructure Migration Strategies Lift and shift: Migrating as-is vs. modernization and re-architecting during the move Phased vs. big bang approaches: Tradeoffs in risk vs. disruption Tools and automation: Using specialized software to streamline the process Dual writes: Managing updates to both old and new systems for a time Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes Data validation & reconciliation: Ensuring consistency between source and target Section 2: Maintaining Performance and Reliability Disaster recovery planning: Failover mechanisms for the new environment Monitoring and alerting: Proactively identifying and addressing issues Capacity planning and forecasting growth to scale the new infrastructure Section 3: Data Consistency and Replication Replication tools - strategies and specialized tooling Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full) Testing/Verification Strategies for validating data correctness in a live environment Implication of large scale systems/environments Comparison of interesting strategies: DBLog, Debezium, Databus, Goldengate etc What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations? When is a migration the wrong choice? What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future? Contact Info LinkedIn (https://www.linkedin.com/in/srirampanyam/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links DagKnows (https://dagknows.com) Google Cloud Dataflow (https://cloud.google.com/dataflow) Seinfeld Risk Management (https://www.youtube.com/watch) ACL == Access Control List (https://en.wikipedia.org/wiki/Access-control_list) LinkedIn Databus - Change Data Capture (https://github.com/linkedin/databus) Espresso Storage (https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) Kafka (https://kafka.apache.org/) Postgres Replication Slots (https://www.postgresql.org/docs/current/logical-replication.html) Queueing Theory (https://en.wikipedia.org/wiki/Queueing_theory) Apache Beam (https://beam.apache.org/) Debezium (https://debezium.io/) Airbyte (https://airbyte.com/) Fivetran (fivetran.com) Designing Data Intensive Applications (https://amzn.to/4aAztR1) by Martin Kleppman (https://martin.kleppmann.com/) (affiliate link) Vector Databases (https://en.wikipedia.org/wiki/Vector_database) Pinecone (https://www.pinecone.io/) Weaviate (https://www.weveate.io/) LAMP Stack (https://en.wikipedia.org/wiki/LAMP_(software_bundle)) Netflix DBLog (https://arxiv.org/abs/2010.12597) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering Podcast
Zenlytic Is Building You A Better Coworker With AI Agents

Data Engineering Podcast

Play Episode Listen Later May 19, 2024 54:19


Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ryan Janssen and Paul Blankley about their experiences building AI powered agents for interacting with your data Interview Introduction How did you get involved in data? In AI? Can you describe what Zenlytic is and the role that AI is playing in your platform? What have been the key stages in your AI journey? What are some of the dead ends that you ran into along the path to where you are today? What are some of the persistent challenges that you are facing? So tell us more about data agents. Firstly, what are data agents and why do you think they're important? How are data agents different from chatbots? Are data agents harder to build? How do you make them work in production? What other technical architectures have you had to develop to support the use of AI in Zenlytic? How have you approached the work of customer education as you introduce this functionality? What are some of the most interesting or erroneous misconceptions that you have heard about what the AI can and can't do? How have you balanced accuracy/trustworthiness with user experience and flexibility in the conversational AI, given the potential for these models to create erroneous responses? What are the most interesting, innovative, or unexpected ways that you have seen your AI agent used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI agent for business intelligence? When is an AI agent the wrong choice? What do you have planned for the future of AI in the Zenlytic product? Contact Info Ryan LinkedIn (https://www.linkedin.com/in/janssenryan) Paul LinkedIn (https://www.linkedin.com/in/paulblankley/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Zenlytic (https://www.zenlytic.com/) Podcast Episode (https://www.dataengineeringpodcast.com/zenlytic-self-serve-business-intelligence-episode-371) Attention is all you need (https://arxiv.org/abs/1706.03762) Transformers (https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) BERT (https://en.wikipedia.org/wiki/BERT_(language_model)) The Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) Richard Sutton PID Loops (https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller) AutoGPT (https://github.com/Significant-Gravitas/AutoGPT) Devin.ai (https://www.cognition.ai/introducing-devin) Google Gemini (https://gemini.google.com/) Anthropic Claude (https://www.anthropic.com/claude) OpenAI Code Interpreter (https://platform.openai.com/docs/assistants/tools/code-interpreter) Edward Tufte (https://www.edwardtufte.com/tufte/books_vdqi) Looker ActionHub (https://developers.looker.com/actions/overview/) OAuth (https://oauth.net/2/) GitHub Copilot (https://github.com/features/copilot) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering Podcast
Release Management For Data Platform Services And Logic

Data Engineering Podcast

Play Episode Listen Later May 12, 2024 20:08


Summary Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I want to talk about my experiences managing the QA and release management process of my data platform Interview Introduction As a team, our overall goal is to ensure that the production environment for our data platform is highly stable and reliable. This is the foundational element of establishing and maintaining trust with the consumers of our data. In order to support this effort, we need to ensure that only changes that have been tested and verified are promoted to production. Our current challenge is one that plagues all data teams. We want to have an environment that mirrors our production environment that is available for testing, but it's not feasible to maintain a complete duplicate of all of the production data. Compounding that challenge is the fact that each of the components of our data platform interact with data in slightly different ways and need different processes for ensuring that changes are being promoted safely. Contact Info LinkedIn () Website (https://www.dataengineeringpodcast.com) Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Data Platforms and Leaky Abstractions Episode (https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374) Building A Data Platform From Scratch (https://www.dataengineeringpodcast.com/designing-a-lakehouse-from-scratch-episode-354) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Trino (https://trino.io/) dbt (https://www.getdbt.com/) Starburst Galaxy (https://www.starburst.io/platform/starburst-galaxy/) Superset (https://superset.apache.org/) Dagster (https://dagster.io/) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) Nessie (https://projectnessie.org/) Podcast Episode (https://www.dataengineeringpodcast.com/nessie-data-lakehouse-data-versioning-episode-416) Iceberg (https://iceberg.apache.org/) Snowflake (https://www.snowflake.com/en/) LocalStack (https://www.localstack.cloud/) DSL == Domain Specific Language (https://en.wikipedia.org/wiki/Domain-specific_language) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Software Engineering Daily
Iceberg at Netflix and Beyond with Ryan Blue

Software Engineering Daily

Play Episode Listen Later Mar 7, 2024 47:37


Apache Iceberg is an open source high-performance format for huge data tables. Iceberg enables the use of SQL tables for big data, while making it possible for engines like Spark and Hive to safely work with the same tables, at the same time. Iceberg was started at Netflix by Ryan Blue and Dan Weeks, and The post Iceberg at Netflix and Beyond with Ryan Blue appeared first on Software Engineering Daily.

netflix spark hive iceberg sql apache iceberg software engineering daily dan weeks