Podcasts about apache iceberg

46PODCASTS
115EPISODES
43mAVG DURATION
1EPISODE EVERY OTHER WEEK
Nov 3, 2025LATEST

POPULARITY

20172018201920202021202220232024

Best podcasts about apache iceberg

Data Engineering Podcast

36 episodes with apache iceberg

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

19 episodes with apache iceberg

Engenharia de Dados [Cast]

2 episodes with apache iceberg

AWS Morning Brief

2 episodes with apache iceberg

The Cloudcast

2 episodes with apache iceberg

Open Source Startup Podcast

2 episodes with apache iceberg

The Ravit Show

3 episodes with apache iceberg

Latest podcast episodes about apache iceberg

Oracle's Juan Loaiza Discusses Trust Privacy, Security in the Age of AI | Cloud Wars Live

Cloud Wars Live with Bob Evans

Play Episode Listen Later Nov 3, 2025 18:56

Juan Loaiza is the EVP of Database Technologies at Oracle. In today's special episode of Cloud Wars Live, Loaiza joins Bob Evans to discuss how AI is transforming the way businesses interact with data. He spotlights Oracle's new AI-native database, the importance of trust and security in enterprise AI, and why business users now play a bigger role in data strategy. It's a revealing look at how Oracle is shaping the future of intelligent data systems.The AI Data RevolutionThe Big Themes:Trust, Governance, and Privacy Must Be Built Into the AI‑Data Stack: One of the strongest points made by Loaiza is about the risk of AI in enterprises: hallucinations, mis‑use of data, privacy violations, regulatory consequences. When mission‑critical systems (hospitals, banks, telecoms) are involved, errors are unacceptable and can be illegal. Oracle's approach is to embed privacy and access controls down into the database engine: the system knows who the end user is, what they can see, and ensures AI cannot leak unauthorized data.Multi‑Cloud, On‑Premises, Hybrid — Customers Want Flexibility: Loaiza describes how Oracle is enabling customers to run their database and AI workloads wherever they need: on‑premises, in public clouds (AWS, Azure, Google Cloud), or via “cloud at your data center” options like Exadata Cloud@Customer. This speaks to regulatory, latency, data sovereignty and operational constraints. For enterprises, the takeaway is that deployment flexibility is essential. A one‑size‑fits‑all cloud model may not meet strategic needs.Business Users and Developers Now Have Voices in Database Strategy: Historically, databases were the domain of DBAs, IT operations, and infrastructure teams. Now business users and developers also have meaningful voices because of AI democratizing access. This shift means organizational structures, roles and processes must change. Data governance, training, tool‑selection and deployment pipelines need to reflect that the “consumer” of the database is broader.The Big Quote: “[AI] can translate English to this language of computers, the language of data, which is SQL. So, what that means is you don't have to learn this crazy language anymore. So pretty much anyone, business people, lay people, can now talk using their normal natural language to the database, and the database will understand what they're saying and give them answers, build applications to all these and this is something I honestly never thought I'd see in my entire life, and it's here today."More from Juan Loaiza and Oracle:Follow Juan on LinkedIn or learn more about Oracle's approach to security. Visit Cloud Wars for more.

Re-Air: The Data Economy: Turning Information into a Tradable Commodity with Viktor Kessler of Vakamo

The Data Stack Show

Play Episode Listen Later Oct 29, 2025 34:21

This episode is a re-air of one of our most popular conversations from this year, featuring insights worth revisiting. Thank you for being part of the Data Stack community. Stay up to date with the latest episodes at datastackshow.com. This week on The Data Stack Show, the crew brings you another conversation live from Data Council in Oakland, California. In this episode, Viktor Kessler from Vakamo explores the evolution of data architecture from rigid warehouses to flexible Lakehouse systems. Powered by Apache Iceberg, this new approach enables seamless data sharing, governance, and potential monetization. Viktor discusses how open-source innovation is transforming data management, highlighting the shift towards treating data as a product and the emerging potential for AI-driven data exchanges. The conversation provides insights into the future of decentralized, adaptable data infrastructure and so much more. Highlights from this week's conversation include:Viktor's Background and Journey in Data (1:20)Evolution of Data Architecture (4:41)The Lakehouse Concept (7:12)Open Source Innovation (11:05)Data Production and Decentralization (15:06)Governance in Decentralized Systems (18:53)Data Economy and Monetization (21:15)Security Concerns in Data Processing (24:21)Impact on Data Consumers (27:37)Compaction Issues in Data Tables (29:39)Open Source Lake Keeper Tool and Parting Thoughts (33:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

How enterprises can enable the Agentic AI Lakehouse on Apache Iceberg

Data Transforming Business

Play Episode Listen Later Oct 29, 2025 14:34

"A flaw of warehouses is that you need to move all your data into them so you can keep it going, and for a lot of organisations that's a big hassle,” says Will Martin, EMEA Evangelist at Dremio. “It can take a long time, it can be expensive, and you ultimately can end up ripping up processes that are there."In this episode of the Don't Panic It's Just Data podcast, recorded live at Big Data LDN (BDL) 2025, Will Martin, EMEA Evangelist at Dremio, joins Shubhangi Dua, Podcast Host and Tech Journalist at EM360Tech. They talk about how enterprises can enable the Agentic AI Lakehouse on Apache Iceberg and why query performance is critical for efficient data analysis. "If you have a data silo, it exists for a reason—something's feeding information to it. You usually have other processes feeding off of it. So if you shift all that to a warehouse, it disrupts a lot of your business," Martin tells Dua. This is where a lakehouse comes into play. Organisations can federate their access through a lakehouse data approach. They can centralise access to the respective organisation's lakehouse while keeping their data in its original location. Such a system helps people get started quickly.In terms of data quality, if you access everything from one location, even with separate data silos, you can see all your data. This visibility allows you to identify issues, address them, and enhance your data quality. That's beneficial for AI, too, Martin explains. Lakehouse Key to AI Infrastructure?Lakehouse has been recognised for unifying and simplifying governance. An imperative feature of a lakehouse is the data catalogue, which helps an organisation browse and find information. It also secures access and manages permissions."You can access in one place, but you can do all your security and permissions in one place rather than all these individual systems, which is great if you work in IT,” reflects Martin. "There are some drawbacks to lakehouses. So, a big component of a lakehouse is metadata. It can be quite big, and it needs managing. Certain companies and vendors are trying to deal with that."With AI and AI agents, it's become even harder to optimise analytics on a lakehouse. However, this has been improved as technical barriers are disappearing. Martin explains that anyone can prompt a question; for instance, an enterprise CEO could ask questions about the data and demand justifications directly. In the past, a request would have to be submitted, and then a data scientist or engineer would create the dataset and hand it over. Now, engineers' roles have changed to focus on better optimisation. They help queries run smoothly and ensure tables are efficient. Agents cannot assist with that.Also Listen: Dremio: The State of the Data LakehouseOptimise LakehouseVendors such as

ceo ai podcast hosts enterprises enable organisations dua agentic tech journalist will martin apache iceberg dremio panic it

Lakehouse Catalogs Beyond Apache Iceberg, What could they Look Like?

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Play Episode Listen Later Oct 17, 2025

Alex Merced discusses different paths to a Universal Lakehouse Catalog standard and their pros and cons. Find links to books, social and more at AlexMerced.com

catalog alex merced apache iceberg

The ERP Minute Episode 205 - September 23rd, 2025

The ERP Advisor

Play Episode Listen Later Sep 24, 2025 3:05

This week, Workday hosted their annual user conference, Workday Rising 2025, taking the opportunity to announce a number of new releases and updates. In other news, CrowdStrike and Salesforce announced a new strategic partnership to enhance the security of AI agents and applications built on Agentforce and the Salesforce platform. To round out the week, Qlik announced the general availability of Qlik Open Lakehouse, a fully managed Apache Iceberg service in Qlik Talent Cloud.Connect with us!https://www.erpadvisorsgroup.com866-499-8550LinkedIn:https://www.linkedin.com/company/erp-advisors-groupTwitter:https://twitter.com/erpadvisorsgrpFacebook:https://www.facebook.com/erpadvisorsInstagram:https://www.instagram.com/erpadvisorsgroupPinterest:https://www.pinterest.com/erpadvisorsgroupMedium:https://medium.com/@erpadvisorsgroup

ai salesforce workday crowdstrike qlik apache iceberg

Under the hood of Apache Iceberg (w/ Christian Thiel)

The Analytics Engineering Podcast

Play Episode Listen Later Aug 24, 2025 55:59

Tristan digs deep into the world of Apache Iceberg. There's a lot happening beneath the surface: multiple catalog interfaces, evolving REST specs, and competing implementations across open source, proprietary, and academic contexts. Christian Thiel, co-founder of Lakekeeper, one of the most widely used Iceberg catalogs, joins to walk through the state of the Iceberg ecosystem. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

hood labs iceberg thiel apache iceberg

From Iceberg to Insight: Qlik's Vision for AI-Ready Data

The Ravit Show

Play Episode Listen Later Jul 23, 2025 11:05

What does it take to build a truly AI-ready data stack? At Qlik Connect, I sat down with Sam Pierson, SVP of the Data Business Unit R&D at Qlik, to talk about what's shaping enterprise data strategies right now.We covered a lot, including:— Why a strong data foundation matters more than ever in the agentic AI era— How Qlik's Open Lakehouse (announced at Qlik Connect) fits into that picture— The role of Apache Iceberg and what's ahead for its adoption in 2025— How the Qlik Talend Cloud and Open Lakehouse come together for customers— And how the data and analytics teams at Qlik align behind one shared visionIt was a timely conversation—especially as more organizations move from experimenting with GenAI to actually deploying it across workflows.If you're thinking about how to connect data, AI, and analytics more seamlessly… this is the kind of conversation worth following.What are you seeing in your own organization when it comes to AI adoption and data readiness?#data #ai #qlikconnect #agents #theravitshow

ai vision data svp iceberg genai qlik apache iceberg

Future of Iceberg and Open Lakehouse Announcement

The Ravit Show

Play Episode Listen Later Jul 22, 2025 7:51

What does the future of open data architecture look like for enterprises? At Qlik Connect, I had the chance to sit down with Ori Rafael and dig into some big announcements that signal a shift in how modern data teams will work:— Open Lakehouse:Qlik launched its Open Lakehouse, built on Apache Iceberg. It's a step toward unifying real-time data access, open table formats, and full AI-readiness—without locking you into a vendor. It's designed for teams who want to move fast, stay open, and keep control of their data.— Upsolver acquisition:Qlik also quietly made a big move—acquiring Upsolver. This adds low-latency ingestion and smarter optimization to Iceberg workflows. It's a clear signal that Qlik is investing deeply in making Iceberg work in production for enterprise-grade use cases.— Scaling Iceberg:Ori shared that scaling Iceberg means solving for more than just storage—it's about governance, performance, and ecosystem compatibility. Qlik's platform aims to help data teams avoid piecing together half-baked solutions and instead offer something integrated.— What enterprise leaders care about:Many are excited about Iceberg because it brings structure, openness, and scalability together. Especially in environments where teams are juggling legacy warehouses, lakehouses, and emerging AI pipelines—it's a foundation they can build on.The data space is changing fast, and conversations like this one at Qlik Connect are a great way to understand where things are headed—and what decisions are driving it.What's your take on Iceberg in enterprise environments?#data #ai #iceberg #qlikconnect #theravitshow

ai iceberg qlik apache iceberg

#124 - The Path to AGI: Inside poolside's AI Model Factory for Code with Eiso Kant

alphalist.CTO Podcast - For CTOs and Technical Leaders

Play Episode Listen Later Jun 27, 2025 63:56 Transcription Available

How do you build a foundation model that can write code at a human level? Eiso Kant (CTO & co-founder, Poolside) reveals the technical architecture, distributed team strategies, and reinforcement learning breakthroughs powering one of Europe's most ambitious AI startups. Learn how Poolside operates 10,000+ H200s, runs the world's largest code execution RL environment, and why CTOs must rethink engineering orgs for an agent-driven future.

ai europe model code factory kant poolside rl ctos reinforcement learning distributed teams foundation models apache iceberg

WBSP736: Grow Your Business by Learning from Enterprise Software Stories - Feb 2025, Ep 6, an Objective Panel Discussion

WBSRocks: Business Growth with ERP and Digital Transformation

Play Episode Listen Later Jun 24, 2025 62:18

Send us a textThe tech landscape is rapidly evolving as major players and emerging startups alike double down on AI-driven innovation and infrastructure transformation. Qlik's acquisition of Upsolver enhances real-time data ingestion for Apache Iceberg, while Epicor's new Prism Vertical AI Agents are reimagining how frontline workers interact with enterprise intelligence. In parallel, Apple's entry into a consortium focused on next-gen AI data centers highlights growing urgency around power and scalability, especially as experts predict data center energy demands will double within five years. Meanwhile, strategic moves like SAP's quantum computing ambitions, IBM's acquisition of AST to deepen Oracle capabilities, and startups like ThoughtSpot, Qbiq, and Vasco advancing AI-powered solutions for analytics, design, and revenue planning underscore a new era of intelligent, responsive enterprise tech.In today's episode, we invited a panel of industry analysts for a live discussion on LinkedIn to analyze current enterprise software stories. We covered many grounds, including the direction and roadmaps of each enterprise software vendor. Finally, we analyzed future trends and how they might shape the enterprise software industry.Background Soundtrack: Away From You – Mauro SommFor more information on growth strategies for SMBs using ERP and digital transformation, visit our community at wbs. rocks or elevatiq.com. To ensure that you never miss an episode of the WBS podcast, subscribe on your favorite podcasting platform.

learning ai stories apple ibm oracle grow your business panel discussion sap objective vasco erp ast smbs enterprise software qlik wbs thoughtspot epicor apache iceberg

249: Quacking Through Data: Duckdb's Emerging Ecosystem

The Data Stack Show

Play Episode Listen Later Jun 18, 2025 19:20

This week on The Data Stack Show, John Wessel and Matt Kelliher-Gibson dive into the recent Duck Lake announcement, exploring the evolving landscape of data analytics technologies. They discuss DuckDB's role as a lightweight, local analytics database and its potential as a caching layer for open table formats like Iceberg. The conversation also highlights the current state of data storage standards, focusing on agreements around Parquet and Iceberg, while noting the ongoing complexity in catalog management. Key takeaways include the importance of local compute solutions, the early stage of open table formats, and the potential for simplified data infrastructure that can provide faster, more cost-effective analytics workflows. The episode underscores the ongoing innovation in data technologies and the need for more streamlined, flexible data management solutions. Don't miss it!Highlights from this week's conversation include:Discussion on Duck Lake Announcement (1:41)Compatibility with Apache Iceberg (4:05)Use Cases for DuckDB (6:23)Concerns About Data Management (10:01)Introduction to Data Formats (11:40)Catalog Space Challenges (13:13)Metadata Orchestration (14:54)Simplicity in Data Management (15:25)SQL Demo Discussion (17:26)Wrap-Up and Final Thoughts (18:44)The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it's needed to power smarter decisions and better customer experiences. Each week, we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

data emerging simplicity final thoughts ecosystem iceberg wrap up compatibility use cases data management parquet duckdb apache iceberg rudderstack

ClickHouse: Breaking the Speed Limit for Observability and Analytics - OpenObservability Talks S5E12

OpenObservability Talks

Play Episode Listen Later May 27, 2025 58:27

The ClickHouse® project is a rising star in observability and analytics, challenging performance conventions with its breakneck speed. This open source OLAP column store, originally developed at Yandex to power their web analytics platform at massive scale, has quickly evolved into one of the hottest open source observability data stores around. Its published performance benchmarks have been the topic of conversation, outperforming many legacy databases and setting a new bar for fast queries over large volumes of data.Our guest for this episode is Robert Hodges, CEO of Altinity — the second largest contributor to the ClickHouse project. With over 30 years of experience in databases, Robert brings deep insights into how ClickHouse is challenging legacy databases at scale. We'll also explore Altinity's just-launched groundbreaking open source project—Project Antalya—which extends ClickHouse with Apache Iceberg shared storage, unlocking dramatic improvements in both performance and cost efficiency. Think 90% reductions in storage costs and 10 to 100x faster queries, all without requiring any changes to your existing applications.The episode was live-streamed on 20 May 2025 and the video is available at https://www.youtube.com/watch?v=VeyTL2JlWp0You can read the recap post: https://medium.com/p/2004160b2f5e/ OpenObservability Talks episodes are released monthly, on the last Thursday of each month and are available for listening on your favorite podcast app and on YouTube.We live-stream the episodes on Twitch and YouTube Live - tune in to see us live, and chime in with your comments and questions on the live chat.⁠⁠https://www.youtube.com/@openobservabilitytalks⁠ https://www.twitch.tv/openobservability⁠Show Notes:00:00 - Intro01:38 - ClickHouse elevator pitch02:46 - guest intro04:48 - ClickHouse under the hood08:15 - SQL and the database evolution path 11:20 - the return of SQL16:13 - design for speed 17:14 - use cases for ClickHouse19:18 - ClickHouse ecosystem22:22 - ClickHouse on Kubernetes 31:45 - know how ClickHouse works inside to get the most out of it 38:59 - ClickHouse for Observability46:58 - Project Antalya55:03 - Kubernetes 1.33 release55:32 - OpenSearch 3.0 release56:01 - New Permissive License for ML Models Announced by the Linux Foundation57:08 - OutroResources:ClickHouse on GitHub: https://github.com/ClickHouse/ClickHouse Shopify's Journey to Planet-Scale Observability: https://medium.com/p/9c0b299a04ddProject Antalya: https://altinity.com/blog/getting-started-with-altinitys-project-antalya https://cmtops.dev/posts/building-observability-with-clickhouse/ Kubernetes 1.33 release highlights: https://www.linkedin.com/feed/update/urn:li:activity:7321054742174924800/ New Permissive License for Machine Learning Models Announced by the Linux Foundation: https://www.linkedin.com/feed/update/urn:li:share:7331046183244611584 Opensearch 3.0 major release: https://www.linkedin.com/posts/horovits_opensearch-activity-7325834736008880128-kCqrSocials:Twitter:⁠ https://twitter.com/OpenObserv⁠YouTube: ⁠https://www.youtube.com/@openobservabilitytalks⁠Dotan Horovits============X (Twitter): @horovitsLinkedIn: www.linkedin.com/in/horovitsMastodon: @horovits@fosstodonBlueSky: @horovits.bsky.socialRobert Hodges=============LinkedIn: https://www.linkedin.com/in/berkeleybob2105/

ceo twitch analytics github sql kubernetes speed limits yandex observability linux foundation olap clickhouse apache iceberg

The ERP Minute Episode 187 - May 20th, 2025

The ERP Advisor

Play Episode Listen Later May 21, 2025 3:35

This week, Sage announced its financial results for the six months to March 31st, 2025, Workday announced a new wave of Illuminate Agents designed to speed up hiring processes, improve worker experiences, streamline financial processes, and empower employees, UKG launched UGK Bryte payroll AI agents for both the UKG Pro and UKG Ready suites, delivering new tools that help all employees from the frontline to payroll administrators, OneStream announced a series of powerful enhancements at its Splash 2025 user conference, and Qlik announced the launch of Qlik Open Lakehouse, a fully managed Apache Iceberg solution built into Qlik Talend Cloud.Connect with us!https://www.erpadvisorsgroup.com866-499-8550LinkedIn:https://www.linkedin.com/company/erp-advisors-groupTwitter:https://twitter.com/erpadvisorsgrpFacebook:https://www.facebook.com/erpadvisorsInstagram:https://www.instagram.com/erpadvisorsgroupPinterest:https://www.pinterest.com/erpadvisorsgroupMedium:https://medium.com/@erpadvisorsgroup

ai splash workday ukg qlik apache iceberg onestream

How Rising Wave Is Redefining Real-Time Data with Postgres Power

The Data Engineering Show

Play Episode Listen Later May 7, 2025 31:36

In this episode of The Data Engineering Show, the bros sit with Yingjun Wu, founder and CEO of Rising Wave, to explore the innovative world of stream processing systems. Yingjun shares his journey from academic research to creating a Postgres-compatible streaming system that drastically reduces resource usage. They discuss how Rising Wave's S3-based architecture and Postgres compatibility provide advantages over traditional systems like Flink, and explore the increasing role of Apache Iceberg in data pipelines.

ceo rising wave redefining s3 flink postgres real time data data warehousing stream processing apache iceberg

Inside the Mind of Snowflake's CEO: Bold Bets in the AI Arms Race

The MAD Podcast with Matt Turck

Play Episode Listen Later Apr 10, 2025 83:41

In this episode, we sit down with Sridhar Ramaswamy, CEO of Snowflake, for an in-depth conversation about the company's transformation from a cloud analytics platform into a comprehensive AI data cloud. Sridhar shares insights on Snowflake's shift toward open formats like Apache Iceberg and why monetizing storage was, in his view, a strategic misstep.We also dive into Snowflake's growing AI capabilities, including tools like Cortex Analyst and Cortex Search, and discuss how the company scaled AI deployments at an impressive pace. Sridhar reflects on lessons from his previous startup, Neeva, and offers candid thoughts on the search landscape, the future of BI tools, real-time analytics, and why partnering with OpenAI and Anthropic made more sense than building Snowflake's own foundation models.SnowflakeWebsite - https://www.snowflake.comX/Twitter - https://x.com/snowflakedbSridhar RamaswamyLinkedIn - https://www.linkedin.com/in/sridhar-ramaswamyX/Twitter - https://x.com/RamaswmySridharFIRSTMARKWebsite - https://firstmark.comX/Twitter - https://twitter.com/FirstMarkCapMatt Turck (Managing Director)LinkedIn - https://www.linkedin.com/in/turck/X/Twitter - https://twitter.com/mattturck(00:00) Intro and current market tumult(02:48) The evolution of Snowflake from IPO to Today(07:22) Why Snowflake's earliest adopters came from financial services(15:33) Resistance to change and the philosophical gap between structured data and AI(17:12) What is the AI Data Cloud?(23:15) Snowflake's AI agents: Cortex Search and Cortex Analyst(25:03) How did Sridhar's experience at Google and Neeva shape his product vision?(29:43) Was Neeva simply ahead of its time?(38:37) The Epiphany mafia(40:08) The current state of search and Google's conundrum(46:45) “There's no AI strategy without a data strategy”(56:49) Embracing Open Data Formats with Iceberg(01:01:45) The Modern Data Stack and the future of BI(01:08:22) The role of real-time data(01:11:44) Current state of enterprise AI: from PoCs to production(01:17:54) Building your own models vs. using foundation models(01:19:47) Deepseek and open source AI(01:21:17) Snowflake's 1M Minds program(01:21:51) Snowflake AI Hub

ceo ai google building current resistance ipo openai epiphany bets bi snowflakes iceberg inside the mind anthropic arms race pocs sridhar neeva modern data stack apache iceberg

Episode 301

Knee-deep in Tech

Play Episode Listen Later Apr 8, 2025 31:31

In this news episode, the trio explores the latest updates in the Windows Insider program. They also discuss how QR code authentication in Entra ID can simplify access for frontline workers in specific scenarios. In Microsoft Fabric, the focus is on integrating Apache Iceberg data with OneLake, along with notable improvements to External Data Sharing. Azure Stream Analytics now supports integration with Azure Event Hub Schema Registry. Lastly, the Azure Virtual Network Manager Network Verifier can be the tool to help gain visibility to your network connectivity in Azure. Hosted on Acast. See acast.com/privacy for more information.

acast qr azure windows insider entra id apache iceberg

Programmers Quickie

Play Episode Listen Later Mar 9, 2025 29:06

BOOK - Apache Iceberg: The Definitive Guide - https://amzn.to/4bD8RB1

streaming big data apache iceberg

Trends in Data Engineering – Adrian Brudaru

DataTalks.Club

Play Episode Listen Later Mar 7, 2025 56:59

In this podcast episode, we talked with Adrian Brudaru about the past, present and future of data engineering.About the speaker:Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted.As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.0:00 Introduction to DataTalks.Club1:05 Discussing trends in data engineering with Adrian2:03 Adrian's background and journey into data engineering5:04 Growth and updates on Adrian's company, DLT Hub9:05 Challenges and specialization in data engineering today13:00 Opportunities for data engineers entering the field15:00 The "Modern Data Stack" and its evolution17:25 Emerging trends: AI integration and Iceberg technology27:40 DuckDB and the emergence of portable, cost-effective data stacks32:14 The rise and impact of dbt in data engineering34:08 Alternatives to dbt: SQLMesh and others35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions37:20 Audience questions: Career focus in data roles and AI engineering overlaps39:00 The role of semantics in data and AI workflows41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem

Will Apache Iceberg and Delta Lake Merge?

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Play Episode Listen Later Feb 21, 2025

Alex Merced discusses the idea of whether Apache Iceberg and Delta Lake could merge. Follow my blog: https://medium.alexmerced.blog

lake delta merge alex merced apache iceberg

LCC 322 - Maaaaveeeeen 4 !

Les Cast Codeurs Podcast

Play Episode Listen Later Feb 9, 2025 77:13

Arnaud et Emmanuel discutent des nouvelles de ce mois. On y parle intégrité de JVM, fetch size de JDBC, MCP, de prompt engineering, de DeepSeek bien sûr mais aussi de Maven 4 et des proxy de répository Maven. Et d'autres choses encore, bonne lecture. Enregistré le 7 février 2025 Téléchargement de l'épisode LesCastCodeurs-Episode-322.mp3 ou en vidéo sur YouTube. News Langages Les evolutions de la JVM pour augmenter l'intégrité https://inside.java/2025/01/03/evolving-default-integrity/ un article sur les raisons pour lesquelles les editeurs de frameworks et les utilisateurs s'arrachent les cheveux et vont continuer garantir l'integrite du code et des données en enlevant des APIs existantes historiquemnt agents dynamiques, setAccessible, Unsafe, JNI Article expliques les risques percus par les mainteneurs de la JVM Franchement c'est un peu leg sur les causes l'article, auto propagande JavaScript Temporal, enfin une API propre et moderne pour gérer les dates en JS https://developer.mozilla.org/en-US/blog/javascript-temporal-is-coming/ JavaScript Temporal est un nouvel objet conçu pour remplacer l'objet Date, qui présente des défauts. Il résout des problèmes tels que le manque de prise en charge des fuseaux horaires et la mutabilité. Temporal introduit des concepts tels que les instants, les heures civiles et les durées. Il fournit des classes pour gérer diverses représentations de date/heure, y compris celles qui tiennent compte du fuseau horaire et celles qui n'en tiennent pas compte. Temporal simplifie l'utilisation de différents calendriers (par exemple, chinois, hébreu). Il comprend des méthodes pour les comparaisons, les conversions et le formatage des dates et des heures. La prise en charge par les navigateurs est expérimentale, Firefox Nightly ayant l'implémentation la plus aboutie. Un polyfill est disponible pour essayer Temporal dans n'importe quel navigateur. Librairies Un article sur les fetch size du JDBC et les impacts sur vos applications https://in.relation.to/2025/01/24/jdbc-fetch-size/ qui connait la valeur fetch size par default de son driver? en fonction de vos use cases, ca peut etre devastateur exemple d'une appli qui retourne 12 lignes et un fetch size de oracle a 10, 2 a/r pour rien et si c'est 50 lignres retournées la base de donnée est le facteur limitant, pas Java donc monter sont fetch size est avantageux, on utilise la memoire de Java pour eviter la latence Quarkus annouce les MCP servers project pour collecter les servier MCP en Java https://quarkus.io/blog/introducing-mcp-servers/ MCP d'Anthropic introspecteur de bases JDBC lecteur de filke system Dessine en Java FX demarrables facilement avec jbang et testes avec claude desktop, goose et mcp-cli permet d'utliser le pouvoir des librarires Java de votre IA d'ailleurs Spring a la version 0.6 de leur support MCP https://spring.io/blog/2025/01/23/spring-ai-mcp-0 Infrastructure Apache Flink sur Kibernetes https://www.decodable.co/blog/get-running-with-apache-flink-on-kubernetes-2 un article tres complet ejn deux parties sur l'installation de Flink sur Kubernetes installation, setup mais aussi le checkpointing, la HA, l'observablité Data et Intelligence Artificielle 10 techniques de prompt engineering https://medium.com/google-cloud/10-prompt-engineering-techniques-every-beginner-should-know-bf6c195916c7 Si vous voulez aller plus loin, l'article référence un très bon livre blanc sur le prompt engineering https://www.kaggle.com/whitepaper-prompt-engineering Les techniques évoquées : Zero-Shot Prompting: On demande directement à l'IA de répondre à une question sans lui fournir d'exemple préalable. C'est comme si on posait une question à une personne sans lui donner de contexte. Few-Shot Prompting: On donne à l'IA un ou plusieurs exemples de la tâche qu'on souhaite qu'elle accomplisse. C'est comme montrer à quelqu'un comment faire quelque chose avant de lui demander de le faire. System Prompting: On définit le contexte général et le but de la tâche pour l'IA. C'est comme donner à l'IA des instructions générales sur ce qu'elle doit faire. Role Prompting: On attribue un rôle spécifique à l'IA (enseignant, journaliste, etc.). C'est comme demander à quelqu'un de jouer un rôle spécifique. Contextual Prompting: On fournit des informations supplémentaires ou un contexte pour la tâche. C'est comme donner à quelqu'un toutes les informations nécessaires pour répondre à une question. Step-Back Prompting: On pose d'abord une question générale, puis on utilise la réponse pour poser une question plus spécifique. C'est comme poser une question ouverte avant de poser une question plus fermée. Chain-of-Thought Prompting: On demande à l'IA de montrer étape par étape comment elle arrive à sa conclusion. C'est comme demander à quelqu'un d'expliquer son raisonnement. Self-Consistency Prompting: On pose plusieurs fois la même question à l'IA et on compare les réponses pour trouver la plus cohérente. C'est comme vérifier une réponse en la posant sous différentes formes. Tree-of-Thoughts Prompting: On permet à l'IA d'explorer plusieurs chemins de raisonnement en même temps. C'est comme considérer toutes les options possibles avant de prendre une décision. ReAct Prompting: On permet à l'IA d'interagir avec des outils externes pour résoudre des problèmes complexes. C'est comme donner à quelqu'un les outils nécessaires pour résoudre un problème. Les patterns GenAI the thoughtworks https://martinfowler.com/articles/gen-ai-patterns/ tres introductif et pre RAG le direct prompt qui est un appel direct au LLM: limitations de connaissance et de controle de l'experience eval: evaluer la sortie d'un LLM avec plusieurs techniques mais fondamentalement une fonction qui prend la demande, la reponse et donc un score numerique evaluation via un LLM (le meme ou un autre), ou evaluation humaine tourner les evaluations a partir de la chaine de build amis aussi en live vu que les LLMs puvent evoluer. Decrit les embedding notament d'image amis aussi de texte avec la notion de contexte DeepSeek et la fin de la domination de NVidia https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvda un article sur les raisons pour lesquelles NVIDIA va se faire cahllengert sur ses marges 90% de marge quand meme parce que les plus gros GPU et CUDA qui est proprio mais des approches ardware alternatives existent qui sont plus efficientes (TPU et gros waffle) Google, MS et d'autres construisent leurs GPU alternatifs CUDA devient de moins en moins le linga franca avec l'investissement sur des langages intermediares alternatifs par Apple, Google OpenAI etc L'article parle de DeepSkeek qui est venu mettre une baffe dans le monde des LLMs Ils ont construit un competiteur a gpt4o et o1 avec 5M de dollars et des capacites de raisonnements impressionnant la cles c'etait beaucoup de trick d'optimisation mais le plus gros est d'avoir des poids de neurores sur 8 bits vs 32 pour les autres. et donc de quatizer au fil de l'eau et au moment de l'entrainement beaucoup de reinforcemnt learning innovatifs aussi et des Mixture of Expert donc ~50x moins chers que OpenAI Donc plus besoin de GPU qui on des tonnes de vRAM ah et DeepSeek est open source un article de semianalytics change un peu le narratif le papier de DeepSkeek en dit long via ses omissions par ensemple les 6M c'est juste l'inference en GPU, pas les couts de recherches et divers trials et erreurs en comparaison Claude Sonnet a coute 10M en infererence DeepSeek a beaucoup de CPU pre ban et ceratins post bans evalués a 5 Milliards en investissement. leurs avancées et leur ouverture reste extremement interessante Une intro à Apache Iceberg http://blog.ippon.fr/2025/01/17/la-revolution-des-donnees-lavenement-des-lakehouses-avec-apache-iceberg/ issue des limites du data lake. non structuré et des Data Warehouses aux limites en diversite de données et de volume entrent les lakehouse Et particulierement Apache Iceberg issue de Netflix gestion de schema mais flexible notion de copy en write vs merge on read en fonction de besoins garantie atomicite, coherence, isoliation et durabilite notion de time travel et rollback partitions cachées (qui abstraient la partition et ses transfos) et evolution de partitions compatbile avec les moteurs de calcul comme spark, trino, flink etc explique la structure des metadonnées et des données Guillaume s'amuse à générer des histoires courtes de Science-Fiction en programmant des Agents IA avec LangChain4j et aussi avec des workflows https://glaforge.dev/posts/2025/01/27/an-ai-agent-to-generate-short-scifi-stories/ https://glaforge.dev/posts/2025/01/31/a-genai-agent-with-a-real-workflow/ Création d'un générateur automatisé de nouvelles de science-fiction à l'aide de Gemini et Imagen en Java, LangChain4j, sur Google Cloud. Le système génère chaque nuit des histoires, complétées par des illustrations créées par le modèle Imagen 3, et les publie sur un site Web. Une étape d'auto-réflexion utilise Gemini pour sélectionner la meilleure image pour chaque chapitre. L'agent utilise un workflow explicite, drivé par le code Java, où les étapes sont prédéfinies dans le code, plutôt que de s'appuyer sur une planification basée sur LLM. Le code est disponible sur GitHub et l'application est déployée sur Google Cloud. L'article oppose les agents de workflow explicites aux agents autonomes, en soulignant les compromis de chaque approche. Car parfois, les Agent IA autonomes qui gèrent leur propre planning hallucinent un peu trop et n'établissent pas un plan correctement, ou ne le suive pas comme il faut, voire hallucine des “function call”. Le projet utilise Cloud Build, le Cloud Run jobs, Cloud Scheduler, Firestore comme base de données, et Firebase pour le déploiement et l'automatisation du frontend. Dans le deuxième article, L'approche est différente, Guillaume utilise un outil de Workflow, plutôt que de diriger le planning avec du code Java. L'approche impérative utilise du code Java explicite pour orchestrer le workflow, offrant ainsi un contrôle et une parallélisation précis. L'approche déclarative utilise un fichier YAML pour définir le workflow, en spécifiant les étapes, les entrées, les sorties et l'ordre d'exécution. Le workflow comprend les étapes permettant de générer une histoire avec Gemini 2, de créer une invite d'image, de générer des images avec Imagen 3 et d'enregistrer le résultat dans Cloud Firestore (base de donnée NoSQL). Les principaux avantages de l'approche impérative sont un contrôle précis, une parallélisation explicite et des outils de programmation familiers. Les principaux avantages de l'approche déclarative sont des définitions de workflow peut-être plus faciles à comprendre (même si c'est un YAML, berk !) la visualisation, l'évolutivité et une maintenance simplifiée (on peut juste changer le YAML dans la console, comme au bon vieux temps du PHP en prod). Les inconvénients de l'approche impérative incluent le besoin de connaissances en programmation, les défis potentiels en matière de maintenance et la gestion des conteneurs. Les inconvénients de l'approche déclarative incluent une création YAML pénible, un contrôle de parallélisation limité, l'absence d'émulateur local et un débogage moins intuitif. Le choix entre les approches dépend des exigences du projet, la déclarative étant adaptée aux workflows plus simples. L'article conclut que la planification déclarative peut aider les agents IA à rester concentrés et prévisibles. Outillage Vulnérabilité des proxy Maven https://github.blog/security/vulnerability-research/attacks-on-maven-proxy-repositories/ Quelque soit le langage, la techno, il est hautement conseillé de mettre en place des gestionnaires de repositories en tant que proxy pour mieux contrôler les dépendances qui contribuent à la création de vos produits Michael Stepankin de l'équipe GitHub Security Lab a cherché a savoir si ces derniers ne sont pas aussi sources de vulnérabilité en étudiant quelques CVEs sur des produits comme JFrog Artifactory, Sonatype Nexus, et Reposilite Certaines failles viennent de la UI des produits qui permettent d'afficher les artifacts (ex: mettez un JS dans un fichier POM) et même de naviguer dedans (ex: voir le contenu d'un jar / zip et on exploite l'API pour lire, voir modifier des fichiers du serveur en dehors des archives) Les artifacts peuvent aussi être compromis en jouant sur les paramètres propriétaires des URLs ou en jouant sur le nomage avec les encodings. Bref, rien n'est simple ni niveau. Tout système rajoute de la compléxité et il est important de les tenir à mettre à jour. Il faut surveiller activement sa chaine de distribution via différents moyens et ne pas tout miser sur le repository manager. L'auteur a fait une présentation sur le sujet : https://www.youtube.com/watch?v=0Z_QXtk0Z54 Apache Maven 4… Bientôt, c'est promis …. qu'est ce qu'il y aura dedans ? https://gnodet.github.io/maven4-presentation/ Et aussi https://github.com/Bukama/MavenStuff/blob/main/Maven4/whatsnewinmaven4.md Apache Maven 4 Doucement mais surement …. c'est le principe d'un projet Maven 4.0.0-rc-2 est dispo (Dec 2024). Maven a plus de 20 ans et est largement utilisé dans l'écosystème Java. La compatibilité ascendante a toujours été une priorité, mais elle a limité la flexibilité. Maven 4 introduit des changements significatifs, notamment un nouveau schéma de construction et des améliorations du code. Changements du POM Séparation du Build-POM et du Consumer-POM : Build-POM : Contient des informations propres à la construction (ex. plugins, configurations). Consumer-POM : Contient uniquement les informations nécessaires aux consommateurs d'artefacts (ex. dépendances). Nouveau Modèle Version 4.1.0 : Utilisé uniquement pour le Build-POM, alors que le Consumer-POM reste en 4.0.0 pour la compatibilité. Introduit de nouveaux éléments et en marque certains comme obsolètes. Modules renommés en sous-projets : “Modules” devient “Sous-projets” pour éviter la confusion avec les Modules Java. L'élément remplace (qui reste pris en charge). Nouveau type de packaging : “bom” (Bill of Materials) : Différencie les POMs parents et les BOMs de gestion des dépendances. Prend en charge les exclusions et les imports basés sur les classifiers. Déclaration explicite du répertoire racine : permet de définir explicitement le répertoire racine du projet. Élimine toute ambiguïté sur la localisation des racines de projet. Nouvelles variables de répertoire : ${project.rootDirectory}, ${session.topDirectory} et ${session.rootDirectory} pour une meilleure gestion des chemins. Remplace les anciennes solutions non officielles et variables internes obsolètes. Prise en charge de syntaxes alternatives pour le POM Introduction de ModelParser SPI permettant des syntaxes alternatives pour le POM. Apache Maven Hocon Extension est un exemple précoce de cette fonctionnalité. Améliorations pour les sous-projets Versioning automatique des parents Il n'est plus nécessaire de définir la version des parents dans chaque sous-projet. Fonctionne avec le modèle de version 4.1.0 et s'étend aux dépendances internes au projet. Support complet des variables compatibles CI Le Flatten Maven Plugin n'est plus requis. Prend en charge les variables comme ${revision} pour le versioning. Peut être défini via maven.config ou la ligne de commande (mvn verify -Drevision=4.0.1). Améliorations et corrections du Reactor Correction de bug : Gestion améliorée de --also-make lors de la reprise des builds. Nouvelle option --resume (-r) pour redémarrer à partir du dernier sous-projet en échec. Les sous-projets déjà construits avec succès sont ignorés lors de la reprise. Constructions sensibles aux sous-dossiers : Possibilité d'exécuter des outils sur des sous-projets sélectionnés uniquement. Recommandation : Utiliser mvn verify plutôt que mvn clean install. Autres Améliorations Timestamps cohérents pour tous les sous-projets dans les archives packagées. Déploiement amélioré : Le déploiement ne se produit que si tous les sous-projets sont construits avec succès. Changements de workflow, cycle de vie et exécution Java 17 requis pour exécuter Maven Java 17 est le JDK minimum requis pour exécuter Maven 4. Les anciennes versions de Java peuvent toujours être ciblées pour la compilation via Maven Toolchains. Java 17 a été préféré à Java 21 en raison d'un support à long terme plus étendu. Mise à jour des plugins et maintenance des applications Suppression des fonctionnalités obsolètes (ex. Plexus Containers, expressions ${pom.}). Mise à jour du Super POM, modifiant les versions par défaut des plugins. Les builds peuvent se comporter différemment ; définissez des versions fixes des plugins pour éviter les changements inattendus. Maven 4 affiche un avertissement si des versions par défaut sont utilisées. Nouveau paramètre “Fail on Severity” Le build peut échouer si des messages de log atteignent un niveau de gravité spécifique (ex. WARN). Utilisable via --fail-on-severity WARN ou -fos WARN. Maven Shell (mvnsh) Chaque exécution de mvn nécessitait auparavant un redémarrage complet de Java/Maven. Maven 4 introduit Maven Shell (mvnsh), qui maintient un processus Maven résident unique ouvert pour plusieurs commandes. Améliore la performance et réduit les temps de build. Alternative : Utilisez Maven Daemon (mvnd), qui gère un pool de processus Maven résidents. Architecture Un article sur les feature flags avec Unleash https://feeds.feedblitz.com//911939960/0/baeldungImplement-Feature-Flags-in-Java-With-Unleash Pour A/B testing et des cycles de développements plus rapides pour « tester en prod » Montre comment tourner sous docker unleash Et ajouter la librairie a du code java pour tester un feature flag Sécurité Keycloak 26.1 https://www.keycloak.org/2025/01/keycloak-2610-released.html detection des noeuds via la proble base de donnée aulieu echange reseau virtual threads pour infinispan et jgroups opentelemetry tracing supporté et plein de fonctionalités de sécurité Loi, société et organisation Les grands morceaux du coût et revenus d'une conférence. Ici http://bdx.io|bdx.io https://bsky.app/profile/ameliebenoit33.bsky.social/post/3lgzslhedzk2a 44% le billet 52% les sponsors 38% loc du lieu 29% traiteur et café 12% standiste 5% frais speaker (donc pas tous) Ask Me Anything Julien de Provin: J'aime beaucoup le mode “continuous testing” de Quarkus, et je me demandais s'il existait une alternative en dehors de Quarkus, ou à défaut, des ressources sur son fonctionnement ? J'aimerais beaucoup avoir un outil agnostique utilisable sur les projets non-Quarkus sur lesquels j'intervient, quitte à y metttre un peu d'huile de coude (ou de phalange pour le coup). https://github.com/infinitest/infinitest/ Conférences La liste des conférences provenant de Developers Conferences Agenda/List par Aurélie Vache et contributeurs : 6-7 février 2025 : Touraine Tech - Tours (France) 21 février 2025 : LyonJS 100 - Lyon (France) 28 février 2025 : Paris TS La Conf - Paris (France) 6 mars 2025 : DevCon #24 : 100% IA - Paris (France) 13 mars 2025 : Oracle CloudWorld Tour Paris - Paris (France) 14 mars 2025 : Rust In Paris 2025 - Paris (France) 19-21 mars 2025 : React Paris - Paris (France) 20 mars 2025 : PGDay Paris - Paris (France) 20-21 mars 2025 : Agile Niort - Niort (France) 25 mars 2025 : ParisTestConf - Paris (France) 26-29 mars 2025 : JChateau Unconference 2025 - Cour-Cheverny (France) 27-28 mars 2025 : SymfonyLive Paris 2025 - Paris (France) 28 mars 2025 : DataDays - Lille (France) 28-29 mars 2025 : Agile Games France 2025 - Lille (France) 3 avril 2025 : DotJS - Paris (France) 3 avril 2025 : SoCraTes Rennes 2025 - Rennes (France) 4 avril 2025 : Flutter Connection 2025 - Paris (France) 4 avril 2025 : aMP Orléans 04-04-2025 - Orléans (France) 10-11 avril 2025 : Android Makers - Montrouge (France) 10-12 avril 2025 : Devoxx Greece - Athens (Greece) 16-18 avril 2025 : Devoxx France - Paris (France) 23-25 avril 2025 : MODERN ENDPOINT MANAGEMENT EMEA SUMMIT 2025 - Paris (France) 24 avril 2025 : IA Data Day 2025 - Strasbourg (France) 29-30 avril 2025 : MixIT - Lyon (France) 7-9 mai 2025 : Devoxx UK - London (UK) 15 mai 2025 : Cloud Toulouse - Toulouse (France) 16 mai 2025 : AFUP Day 2025 Lille - Lille (France) 16 mai 2025 : AFUP Day 2025 Lyon - Lyon (France) 16 mai 2025 : AFUP Day 2025 Poitiers - Poitiers (France) 24 mai 2025 : Polycloud - Montpellier (France) 24 mai 2025 : NG Baguette Conf 2025 - Nantes (France) 5-6 juin 2025 : AlpesCraft - Grenoble (France) 5-6 juin 2025 : Devquest 2025 - Niort (France) 10-11 juin 2025 : Modern Workplace Conference Paris 2025 - Paris (France) 11-13 juin 2025 : Devoxx Poland - Krakow (Poland) 12-13 juin 2025 : Agile Tour Toulouse - Toulouse (France) 12-13 juin 2025 : DevLille - Lille (France) 13 juin 2025 : Tech F'Est 2025 - Nancy (France) 17 juin 2025 : Mobilis In Mobile - Nantes (France) 24 juin 2025 : WAX 2025 - Aix-en-Provence (France) 25-26 juin 2025 : Agi'Lille 2025 - Lille (France) 25-27 juin 2025 : BreizhCamp 2025 - Rennes (France) 26-27 juin 2025 : Sunny Tech - Montpellier (France) 1-4 juillet 2025 : Open edX Conference - 2025 - Palaiseau (France) 7-9 juillet 2025 : Riviera DEV 2025 - Sophia Antipolis (France) 18-19 septembre 2025 : API Platform Conference - Lille (France) & Online 2-3 octobre 2025 : Volcamp - Clermont-Ferrand (France) 6-10 octobre 2025 : Devoxx Belgium - Antwerp (Belgium) 9-10 octobre 2025 : Forum PHP 2025 - Marne-la-Vallée (France) 16-17 octobre 2025 : DevFest Nantes - Nantes (France) 4-7 novembre 2025 : NewCrafts 2025 - Paris (France) 6 novembre 2025 : dotAI 2025 - Paris (France) 7 novembre 2025 : BDX I/O - Bordeaux (France) 12-14 novembre 2025 : Devoxx Morocco - Marrakech (Morocco) 28-31 janvier 2026 : SnowCamp 2026 - Grenoble (France) 23-25 avril 2026 : Devoxx Greece - Athens (Greece) 17 juin 2026 : Devoxx Poland - Krakow (Poland) Nous contacter Pour réagir à cet épisode, venez discuter sur le groupe Google https://groups.google.com/group/lescastcodeurs Contactez-nous via X/twitter https://twitter.com/lescastcodeurs ou Bluesky https://bsky.app/profile/lescastcodeurs.com Faire un crowdcast ou une crowdquestion Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Tous les épisodes et toutes les infos sur https://lescastcodeurs.com/

netflix google apple france spring ms data fail tree web dans car expert tout ia science fiction faire chain gemini unleash nvidia blue sky api peut conf sous ui nouvelle nouveau 5m java github guillaume workflow apis aur bref mise ici temporal 10m llm imagen nouvelles warn arnaud gestion suppression prise wax cpu genai maven gpu google cloud php bient unsafe vall 6m js loi prend kubernetes rag quelque urls aix intelligence artificielle pom orl enregistr changements montre modules mixture milliards paris france mcp possibilit fonctionne flink cuda nosql utilis constructions firebase dessine poms jvm vache data warehouses yaml remplace tpu cves versioning devcon jdk lyon france nouveau mod vram boms apache iceberg cloud run provence france jdbc strasbourg france firestore lille france cloud build cloud firestore

Amazon S3 Tables explained: Better storage for AWS Analytics workloads

Cloud Masters

Play Episode Listen Later Jan 29, 2025 26:08

AWS Analytics expert Swapnil Bhoite joins us to break down of Amazon S3 Tables. From comparing Parquet and Apache Iceberg formats to explaining critical features like compaction and snapshot management, Swapnil explores how this fully-managed service streamlines data lake operations. Learn when to adopt S3 Tables, understand its cost-performance benefits, and discover key migration considerations from existing Glue catalog implementations — essential knowledge for teams looking to scale their analytics workloads on AWS.

analytics storage tables aws glue workload parquet amazon s3 swapnil apache iceberg

The ERP Minute Episode 170 - January 21st, 2025

The ERP Advisor

Play Episode Listen Later Jan 22, 2025 3:05

As we continue to charge into 2025, major AI announcements came from Epicor, Oracle, and Microsoft. First, Epicor launched Epicor Prism, a network of vertical AI agents built specifically for the supply chain industries. Then, Oracle and Adarga, a leader in AI-driven information intelligence, announced they are partnering to bring Adarga's Vantage software to Oracle Cloud Infrastructure and Oracle's distributed cloud. Finally, Microsoft and Pearson, the world's lifelong learning company, announced a strategic collaboration on Tuesday to help address one of the top challenges facing organizations globally: skilling for the era of AI. To round out the week, Qlik announced the acquisition of Upsolver, a pioneer in real-time data streaming and Apache Iceberg optimization.Connect with us!https://www.erpadvisorsgroup.com866-499-8550LinkedIn:https://www.linkedin.com/company/erp-advisors-groupTwitter:https://twitter.com/erpadvisorsgrpFacebook:https://www.facebook.com/erpadvisorsInstagram:https://www.instagram.com/erpadvisorsgroupPinterest:https://www.pinterest.com/erpadvisorsgroupMedium:https://medium.com/@erpadvisorsgroup

ai microsoft oracle pearson vantage qlik epicor apache iceberg

A Return to Greatness, or Degenerate Day 3?

AWS Morning Brief

Play Episode Listen Later Dec 9, 2024 18:39

AWS Morning Brief for the week of December 9, with Corey Quinn. Links:AWS announces access to VPC resources over AWS PrivateLinkAnnouncing Amazon Aurora DSQL (Preview)Announcing Amazon Bedrock IDE in preview as part of Amazon SageMaker Unified StudioAWS announces Amazon CloudWatch Database InsightsAmazon DynamoDB global tables previews multi-Region strong consistencyAmazon EC2 introduces Allowed AMIs to enhance AMI governanceAnnouncing Amazon EC2 I8g instancesAnnouncing Amazon EKS Auto ModeAnnouncing Amazon EKS Hybrid NodesAnnouncing Amazon Elastic VMware Service (Preview)Announcing Amazon FSx Intelligent-Tiering, a new storage class for FSxAmazon Q Developer can now automate code reviewsAmazon Q Developer announces automatic unit test generation to accelerate feature developmentAmazon S3 adds new default data integrity protectionsAnnouncing Amazon S3 Metadata (Preview) – Easiest and fastest way to manage your metadataAmazon S3 launches storage classes for AWS Dedicated Local ZonesAnnouncing Amazon S3 Tables – Fully managed Apache Iceberg tables optimized for analytics workloadsAWS announces Amazon SageMaker LakehouseAWS Control Tower launches managed controls using declarative policiesAWS announces AWS Data Transfer Terminal for high-speed data uploadsAmazon Web Services announces declarative policiesIntroducing AWS Glue 5.0AWS announces Invoice ConfigurationAWS Marketplace now offers EC2 Image Builder components from independent software vendorsAWS announces AWS Security Incident Response for general availabilityAnnouncing AWS Transfer Family web appsBuy with AWS accelerates solution discovery and procurement on AWS Partner websitesOracle Database@AWS is now in limited previewPartyRock improves app discovery and announces upcoming free daily useAnnouncing the preview of Amazon SageMaker Unified StudioVPC Lattice now includes TCP support with VPC ResourcesAnnouncing the 2024 Geo and Global AWS Partners of the YearAmazon MemoryDB Multi-Region is now generally availableTop announcements of AWS re:Invent 2024SponsorThe Duckbill Group: https://www.duckbillgroup.com/

amazon greatness cloud region aws devops geo degenerate tcp vpc corey quinn apache iceberg last week in aws

63 – Reinvent, AWS S3 Table Buckets and Apache Iceberg

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Play Episode Listen Later Dec 6, 2024

Alex Merced discusses his experience at AWS re:invent follow Alex at AlexMered.com/data

table aws reinvent buckets aws s3 alex merced apache iceberg

AI, Community, and the Future of Generative Applications

Open at Intel

Play Episode Listen Later Nov 27, 2024 20:53

In this engaging conversation at the All Things Open conference, Tim Spann, Principal Developer Advocate at Zilliz, discusses the importance of community collaboration in advancing AI technologies. He emphasizes the need for diverse perspectives in solving complex problems and highlights his work with the Milvus open source vector database. Tim also explains the evolving landscape of retrieval augmented generation (RAG) and its applications and shares insights into the future of AI development. The conversation concludes on a lighter note with Tim describing his creative use of Milvus in a fun Halloween project to catalog and identify ghosts. 00:00 Introduction 00:41 Meet Tim Spann: Principal Developer Advocate 01:35 The Importance of Community in AI 02:56 Advanced RAG and Multimodal Models 06:17 The Future of Agentic RAG 09:04 Challenges and Excitement in AI Development 13:35 Building AI the Right Way 17:50 Fun with AI: Capturing Ghosts 19:24 Conclusion and Final Thoughts Guest: Tim Spann is a Principal Developer Advocate for Zilliz and Milvus. He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal Developer Advocate at Cloudera, Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.

Data Lakehouses & Apache Iceberg

The Cloudcast

Play Episode Listen Later Oct 16, 2024 28:35

Alex Merced (@AMdatalakehouse, Senior Tech Evangelist, @dremio) talks about everything data and we dig deep into Apache Iceberg and DataLakehouses.SHOW: 865Want to go to All Things Open in Raleigh for FREE? (Oct 27th-29th)We are offering 5 Free passes, first come, first serve for the Cloudcast Community -> Registration Link Instructions:Click reg linkClick “Get Tickets”Choose ticket optionProceed with registration (discount will automatically be applied, cost will be $0)SHOW TRANSCRIPT: The Cloudcast #865 TranscriptSHOW VIDEO: https://youtube.com/@TheCloudcastNET CLOUD NEWS OF THE WEEK: - http://bit.ly/cloudcast-cnotwNEW TO CLOUD? CHECK OUT OUR OTHER PODCAST: - "CLOUDCAST BASICS" SHOW NOTES:Dremio (homepage)Hands-on with Apache Iceberg TutorialApache Iceberg Crash CourseData Lakehouses and Apache Hudi (Cloudcast Eps. 694)Apache Iceberg, the Definitive Guide (eBook)Apache Iceberg (homepage)Iceberg + Nessie Catalog (homepage)Iceberg + Polaris Catalog (homepage)AlexMerced.comDataLakehouseHub.comTopic 1 - Welcome to the show. Tell us a little bit about your background. Topic 2 - It's been a little while since we talked about Data Lakehouses, can you give us a little bit of background on this space, and what the most recent dynamics are around these technologies.Topic 3 - What are the typical integrations with a Data Lakehouse? How are users/developers typically interacting with Data Lakehouse technologies? [The marketplace for Iceberg catalogs like Nessie and Polaris]Topic 4 - How does an open data format like Apache Iceberg fit into the bigger picture of data lakehouses, or large scale stores of data? Topic 5 - How does Dremio enable Iceberg? How does Dremio sit in the intersection of Data Lakehouse, Data Mesh and Data Virtualization trends all of which come from the same fundamental problem, the growing scale of data use cases.Topic 6 - We've seen companies start to rethink their data in the cloud strategies. Are you seeing on-premises making a comeback for large data applicationsFEEDBACK?Email: show at the cloudcast dot netTwitter: @cloudcastpodInstagram: @cloudcastpodTikTok: @cloudcastpod

ai data hands big data raleigh iceberg data mesh alex merced data lakehouse apache iceberg dremio all things open

61 – What’s New In dbt? (dbt coalesce 2024)

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Play Episode Listen Later Oct 11, 2024

Alex Merced discusses the news and announcements for dbt coalest 2024. Announcements Alex didn’t mention:– dbt Apache Iceberg support, this is done by working with Iceberg supporting query engines like Dremio – Healthtiles with more information on your dashboard about the health of your models – Auto-exposures in Tableau triggering BI Dashboard updates when models […]

auto iceberg tableau coalesce alex merced apache iceberg dremio

Starburst CEO on Future Of Data-Query Engines

Tech Disruptors

Play Episode Listen Later Oct 8, 2024 44:40

The highly distributed, dispersed and dynamic nature of enterprise data fuels demand for robust data-query engines for analytics and to drive intelligence. In this episode of the Tech Disruptors podcast, Starburst CEO Justin Borgman joins Sunil Rajgopal, senior software analyst at Bloomberg Intelligence, to discuss the shifting landscape for these products. They examine the future of data solutions, the evolving competitive landscape and developers' embrace of open-table formats like Apache Iceberg, with Borgman saying this was “the summer of Iceberg.” The two also talk about Starburst's product journey, competition with Dremio and Snowflake, and corporate IT-spending momentum.

data snowflakes iceberg engines query starburst bloomberg intelligence borgman apache iceberg dremio

Leveraging Open Source Technologies for Data Lakehouses with Alex Merced, Senior Tech Evangelist at Dremio

Over The Edge

Play Episode Listen Later Oct 2, 2024 44:01

What makes data lakehouses a game changer in modern data management? In this episode, Bill sits down with Alex Merced, Senior Tech Evangelist at Dremio, to explore the evolution of data lakehouses and their role in bridging the gap between data lakes and data warehouses. Alex breaks down the components of data lakehouses and dives into the rise of Apache Iceberg.---------Key Quotes:“I love just get really deep into technology, really see what it does. And then scream at the rooftops how cool it is. And basically that was my charter. And [Apache] Iceberg, the more I learned about it, the more I realized this is really interesting.”“Interoperability and data. Basically, a lot of the things that kept data in silos is now breaking apart.”"So here we're talking about something that's going to be a standard. And that's when I think of the highest levels of openness matter because if it's something that a whole industry is going to build on, it should be something that the whole industry has to say in its evolution…And that's the beauty of openness that it does create these nice sort of places where we can collaborate and compete together.”--------Timestamps: (01:32) How Alex got started in his career(03:54) Breaking down data lakehouses(07:08) The idea behind an open data lakehouse(10:10) Alex's involvement with Apache Iceberg(15:13) Key components of a data lakehouse(23:41) The growth of Apache Iceberg(32:07) Dremio's Apache Iceberg crash course(38:43) Explaining self-service analytics--------Sponsor:Over the Edge is brought to you by Dell Technologies to unlock the potential of your infrastructure with edge solutions. From hardware and software to data and operations, across your entire multi-cloud environment, we're here to help you simplify your edge so you can generate more value. Learn more by visiting dell.com/edge for more information or click on the link in the show notes.--------Credits:Over the Edge is hosted by Bill Pfeifer, and was created by Matt Trifiro and Ian Faison. Executive producers are Matt Trifiro, Ian Faison, Jon Libbey and Kyle Rusca. The show producer is Erin Stenhouse. The audio engineer is Brian Thomas. Additional production support from Elisabeth Plutko.--------Links:Follow Bill on LinkedInFollow Alex on LinkedIn

technology data executives senior leveraging explaining open source interoperability dell technologies edge computing brian thomas data warehouses over the edge tech evangelist alex merced data lakehouse apache iceberg dremio ian faison

267: Why the Data Lakehouse Is the Future—But What's Stopping It from Getting There? - Upsolver

Data Protection Gumbo

Play Episode Listen Later Oct 1, 2024 26:37

Ori Rafael, CEO and co-founder of Upsolver explores the future of data management through data lakehouses. He explains the evolution of the lakehouse, a revolutionary architecture that combines the best of data lakes and warehouses. You will gain insights into key technologies like Apache Iceberg, how lakehouses enable advanced use cases such as AI, and how they help businesses reduce costs.

ceo ai cybersecurity stopping big data data lakehouse apache iceberg

How Apache Iceberg and Flink Can Ease Developer Pain

The New Stack Podcast

Play Episode Listen Later Sep 12, 2024 47:08

In the New Stack Makers episode, Adi Polak, Director, Advocacy and Developer Experience Engineering at Confluent discusses the operational and analytical estates in data infrastructure. The operational estate focuses on fast, low-latency event-driven applications, while the analytical estate handles long-running data crunching tasks. Challenges arise due to the "schema evolution" from upstream operational changes impacting downstream analytics, creating complexity for developers. Apache Iceberg and Flink help mitigate these issues. Iceberg, a table format developed by Netflix, optimizes querying by managing file relationships within a data lake, reducing processing time and errors. It has been widely adopted by major companies like Airbnb and LinkedIn. Apache Flink, a versatile data processing framework, is driving two key trends: shifting some batch processing tasks into stream processing and transitioning microservices into Flink streaming applications. This approach enhances system reliability, lowers latency, and meets customer demands for real-time data, like instant flight status updates. Together, Iceberg and Flink streamline data infrastructure, addressing developer pain points and improving efficiency. Learn more from The New Stack about Apache Iceberg and Flink:Unfreeze Apache Iceberg to Thaw Your Data LakehouseApache Flink: 2023 Retrospective and Glimpse into the Future 4 Reasons Why Developers Should Use Apache Flink Join our community of newsletter subscribers to stay on top of the news and at the top of your game.

#681: Amazon DynamoDB Deep Dive

AWS Podcast

Play Episode Listen Later Aug 19, 2024 48:56

Simon is joined by Jason Hunter, AWS Principal Specialist Solutions Architect, do dive super-deep into how to make the most of DynamoDB. Whether you are new to DynamoDB, or have been using it for years - there is something in this episode for everyone! Shownotes: Jason's Blog Posts: https://aws.amazon.com/blogs/database/author/jzhunter/ The Apache Iceberg blog: https://aws.amazon.com/blogs/database/use-amazon-dynamodb-incremental-export-to-update-apache-iceberg-tables/ Traffic spikes (on-demand vs provisioned): https://aws.amazon.com/blogs/database/handle-traffic-spikes-with-amazon-dynamodb-provisioned-capacity/ Cost-effective bulk actions like delete: https://aws.amazon.com/blogs/database/cost-effective-bulk-processing-with-amazon-dynamodb/ A deep dive on partitions: https://aws.amazon.com/blogs/database/part-1-scaling-dynamodb-how-partitions-hot-keys-and-split-for-heat-impact-performance/ Global tables prescriptive guidance (the 25 page deep dive): https://docs.aws.amazon.com/prescriptive-guidance/latest/dynamodb-global-tables/introduction.html

global cost deep dive traffic blog posts dynamodb apache iceberg jason hunter amazon dynamodb

What is a Data Lakehouse with Upsolver's Ori Rafael

Partially Redacted: Data Privacy, Security & Compliance

Play Episode Listen Later Aug 14, 2024 31:59

In this episode, we sit down with Ori Rafael, CEO and Co-founder of Upsolver, to explore the rise of the lakehouse architecture and its significance in modern data management. Ori breaks down the origins of the lakehouse and how it leverages S3 to provide scalable and cost-effective storage. We discuss the critical role of open table formats like Apache Iceberg in unifying data lakes and warehouses, and how ETL processes differ between these environments. Ori also shares his vision for the future, highlighting how Upsolver is positioned to empower organizations as they navigate the rapidly evolving data landscape.

ceo ori s3 etl data lakehouse apache iceberg

E142: Redefining Self-Serve Analytics with Dremio

Open Source Startup Podcast

Play Episode Listen Later Jul 16, 2024 41:26

Tomer Shiran is Founder of Dremio, the data lakehouse platform for self-service analytics and AI based on open source frameworks Apache Arrow, which the Dremio team created, and Apache Iceberg. Dremio has raised over $400M from investors including Norwest, Redpoint, Adams Street, Sapphire, Insight, and Lightspeed. They are currently valued at $2B. In this episode, we dig into Tomer's journey from MapR to Dremio, his initial vision for making the data stack more accessible, their first breakthrough with Apache Arrow and a columnar-format approach, focusing first on project-market fit before monetization, adding support for Apache Iceberg, how they're using AI to improve user experiences & more!

founders ai redefining analytics 2b 400m lightspeed tomer self serve redpoint norwest apache arrow mapr apache iceberg dremio

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Podcast

Play Episode Listen Later Jun 30, 2024 59:48

Summary This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigor Interview Introduction How did you get involved in the area of data management? Can you describe what Synq is and the story behind it? Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address? Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary? What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team? How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach? With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance? Can you describe how Synq is designed/implemented? How have the scope and goals of the product changed since you first started working on it? For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows? What are the types of incidents/errors that you are able to identify and alert on? What does a typical incident/error resolution process look like with Synq? What are the most interesting, innovative, or unexpected ways that you have seen Synq used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq? When is Synq the wrong choice? What do you have planned for the future of Synq? Contact Info LinkedIn (https://www.linkedin.com/in/petr-janda/?originalSubdomain=dk) Substack (https://substack.com/@petrjanda) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Synq (https://www.synq.io/) Incident Management (https://www.pagerduty.com/resources/learn/what-is-incident-management/) SLA == Service Level Agreement (https://en.wikipedia.org/wiki/Service-level_agreement) Data Governance (https://en.wikipedia.org/wiki/Data_governance) Podcast Episode (https://www.dataengineeringpodcast.com/nicola-askham-practical-data-governance-episode-428) PagerDuty (https://www.pagerduty.com/) OpsGenie (https://www.atlassian.com/software/opsgenie) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) SQLMesh (https://sqlmesh.readthedocs.io/en/stable/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Stitching Together Enterprise Analytics With Microsoft Fabric

Data Engineering Podcast

Play Episode Listen Later Jun 23, 2024 53:22

Summary Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou Interview Introduction How did you get involved in the area of data management? Can you describe what Microsoft Fabric is and the story behind it? Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution? What are the elements of Fabric that were engineered specifically for the service? What are the most interesting/complicated integration challenges? How has your prior experience with Ahana and Presto informed your current work at Microsoft? AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine? What are the challenges in terms of safety and reliability? What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? When is Fabric the wrong choice? What do you have planned for the future of data lake analytics? Contact Info LinkedIn (https://www.linkedin.com/in/diptiborkar/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Microsoft Fabric (https://www.microsoft.com/microsoft-fabric) Ahana episode (https://www.dataengineeringpodcast.com/ahana-presto-cloud-data-lake-episode-217) DB2 Distributed (https://www.ibm.com/docs/en/db2/11.5?topic=managers-designing-distributed-databases) Spark (https://spark.apache.org/) Presto (https://prestodb.io/) Azure Data (https://azure.microsoft.com/en-us/products#analytics) MAD Landscape (https://mattturck.com/mad2024/) Podcast Episode (https://www.dataengineeringpodcast.com/mad-landscape-2023-data-infrastructure-episode-369) ML Podcast Episode (https://www.themachinelearningpodcast.com/mad-landscape-2023-ml-ai-episode-21) Tableau (https://www.tableau.com/) dbt (https://www.getdbt.com/) Medallion Architecture (https://dataengineering.wiki/Concepts/Medallion+Architecture) Microsoft Onelake (https://learn.microsoft.com/fabric/onelake/onelake-overview) ORC (https://orc.apache.org/) Parquet (https://parquet.incubator.apache.org) Avro (https://avro.apache.org/) Delta Lake (https://delta.io/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) Hadoop (https://hadoop.apache.org/) PowerBI (https://www.microsoft.com/power-platform/products/power-bi) Podcast Episode (https://www.dataengineeringpodcast.com/power-bi-business-intelligence-episode-154) Velox (https://velox-lib.io/) Gluten (https://gluten.apache.org/) Apache XTable (https://xtable.apache.org/) GraphQL (https://graphql.org/) Formula 1 (https://www.formula1.com/) McLaren (https://www.mclaren.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

Play Episode Listen Later Jun 16, 2024 53:19

Summary Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse Interview Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture? What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure? What were the requirements and selection criteria that led to the selection of that combination of technologies? What are the other systems that feed into and rely on the Trino/Iceberg service? what kinds of questions are you answering with table metadata what use case/team does that support comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe? Contact Info Substack (https://kevinjqliu.substack.com) LinkedIn (https://www.linkedin.com/in/kevinjqliu) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Trino (https://trino.io/) Iceberg (https://iceberg.apache.org/) Stripe (https://stripe.com/) Spark (https://spark.apache.org/) Redshift (https://aws.amazon.com/redshift/) Hive Metastore (https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore) Python Iceberg (https://py.iceberg.apache.org/) Python Iceberg REST Catalog (https://github.com/kevinjqliu/iceberg-rest-catalog) Trino Metadata Table (https://trino.io/docs/current/connector/iceberg.html#metadata-tables) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) Tabular (https://tabular.io/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Delta Table (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Databricks Unity Catalog (https://www.databricks.com/product/unity-catalog) Starburst (https://www.starburst.io/) AWS Athena (https://aws.amazon.com/athena/) Kevin Trinofest Presentation (https://trino.io/blog/2023/07/19/trino-fest-2023-stripe.html) Alluxio (https://www.alluxio.io/) Podcast Episode (https://www.dataengineeringpodcast.com/alluxio-distributed-storage-episode-70) Parquet (https://parquet.incubator.apache.org/) Hudi (https://hudi.apache.org/) Trino Project Tardigrade (https://trino.io/blog/2022/05/05/tardigrade-launch.html) Trino On Ice (https://www.starburst.io/blog/iceberg-table-partitioning/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

spark trusted doordash python comcast hive data driven stripe iceberg hug starburst flink redshift parquet trino hudi tabular apache iceberg freak fandango orchestra alluxio

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

Play Episode Listen Later Jun 9, 2024 42:22

Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ronen Korman and Stav Elkayam about pulling back the curtain on your real-time data streams by bringing intuitive observability to Flink streams Interview Introduction How did you get involved in the area of data management? Can you describe what Datorios is and the story behind it? Data observability has been gaining adoption for a number of years now, with a large focus on data warehouses. What are some of the unique challenges posed by Flink? How much of the complexity is due to the nature of streaming data vs. the architectural realities of Flink? How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it? How have the requirements of generative AI shifted the demand for streaming data systems? What role does Flink play in the architecture of generative AI systems? Can you describe how Datorios is implemented? How has the design and goals of Datorios changed since you first started working on it? How much of the Datorios architecture and functionality is specific to Flink and how are you thinking about its potential application to other streaming platforms? Can you describe how Datorios is used in a day-to-day workflow for someone building streaming applications on Flink? What are the most interesting, innovative, or unexpected ways that you have seen Datorios used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datorios? When is Datorios the wrong choice? What do you have planned for the future of Datorios? Contact Info Ronen LinkedIn (https://www.linkedin.com/in/ronen-korman/) Stav LinkedIn (https://www.linkedin.com/in/stav-elkayam-118a2795/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Datorios (https://datorios.com/) Apache Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) ChatGPT-4o (https://openai.com/index/hello-gpt-4o/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Practical First Steps In Data Governance For Long Term Success

Data Engineering Podcast

Play Episode Listen Later Jun 2, 2024 60:40

Summary Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Nicola Askham about the practical steps of building out a data governance practice in your organization Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the scope and boundaries of data governance in an organization? At what point does a lack of an explicit governance policy become a liability? What are some of the misconceptions that you encounter about data governance? What impact has the evolution of data technologies had on the implementation of governance practices? (e.g. number/scale of systems, types of data, AI) Data governance can often become an exercise in boiling the ocean. What are the concrete first steps that will increase the success rate of a governance practice? Once a data governance project is underway, what are some of the common roadblocks that might derail progress? What are the net benefits to the data team and the organization when a data governance practice is established, active, and healthy? What are the most interesting, innovative, or unexpected ways that you have seen data governance applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data governance/training/coaching? What are some of the pitfalls in data governance? What are some of the future trends in data governance that you are excited by? Are there any trends that concern you? Contact Info Website (https://www.nicolaaskham.com/) LinkedIn (https://www.linkedin.com/in/nicolaaskham/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Website (https://www.nicolaaskham.com/) Master Data Management (https://en.wikipedia.org/wiki/Master_data_management) Cartesian Join (https://www.geeksforgeeks.org/cartesian-join/) DAMA == Data Management Community (https://www.dama.org/) DMBOK == Data Management Body of Knowledge (https://www.dama.org/cpages/body-of-knowledge) DAMA DMBOK Wheel (https://www.dama.org/cpages/dmbok-2-wheel-images) CDMP (Certified Data Management Professional) Exam (https://www.dama.org/cpages/cdmp-information) Data Mesh (https://www.datamesh-architecture.com/) Data Governance First Steps Checklist (https://www.nicolaaskham.com/free-data-governance-checklist) The Never Normal (https://www.linkedin.com/newsletters/the-never-normal-6862024032934477824/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Dremio Tech Evangelist Andrew Madson

How AI Happens

Play Episode Listen Later May 30, 2024 29:19

Andrew shares how generative AI is used by academic institutions, why employers and educators need to curb their fear of AI, what we need to consider for using AI responsibly, and the ins and outs of Andrew's podcast, Insight x Design. Key Points From This Episode:Andrew Madson explains what a tech evangelist is and what his role at Dremio entails. The ins and outs of Dremio. Understanding the pain points that Andrew wanted to alleviate by joining Dremio. How Andrew became a tech evangelist, and why he values this role.Why all tech roles now require one to upskill and branch out into other areas of expertise. The problems that Andrew most commonly faces at work, and how he overcomes them. How Dremio uses generative AI, and how the technology is used in academia. Why employers and educators need to do more to encourage the use of AI. The provenance of training data, and other considerations for the responsible use of AI. Learning more about Andrew's new podcast, Insight x Design. Quotes:“Once I learned about lakehouses and Apache Iceberg and how you can just do all of your work on top of the data lake itself, it really made my life a lot easier with doing real-time analytics.” — @insightsxdesign [0:04:24]“Data analysts have always been expected to be technical, but now, given the rise of the amount of data that we're dealing with and the limitations of data engineering teams and their capacity, data analysts are expected to do a lot more data engineering.” — @insightsxdesign [0:07:49]“Keeping it simple and short is ideal when dealing with AI.” — @insightsxdesign [0:12:58]“The purpose of higher education isn't to get a piece of paper, it's to learn something and to gain new skills.” — @insightsxdesign [0:17:35]Links Mentioned in Today's Episode:Andrew MadsonAndrew Madson on LinkedInAndrew Madson on XAndrew Madson on InstagramDremio Insights x DesignApache IcebergChatGPTPerplexity AIGeminiAnaconda Peter Wang on LinkedInHow AI HappensSama

learning ai design data madson tech evangelist apache iceberg dremio

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

Play Episode Listen Later May 27, 2024 60:00

Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process Interview Introduction How did you get involved in the area of data management? Can you start by sharing some of your experiences with data migration projects? As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems? How would you categorize the different types and motivations of migrations? How does the motivation for a migration influence the ways that you plan for and execute that work? Can you talk us through one or two specific projects that you have taken part in? Part 1: The Triggers Section 1: Technical Limitations triggering Data Migration Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure Legacy compatibility: Difficulties integrating with modern tools and cloud platforms System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade) Section 2: Types of Migrations for Infrastructure Focus Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.) Data center migration: Physical relocation or consolidation of data centers Virtualization migration: Moving from physical servers to virtual machines (or vice versa) Section 3: Technical Decisions Driving Data Migrations End-of-life support: Forced migration when older software or hardware is sunsetted Security and compliance: Adopting new platforms with better security postures Cost Optimization: Potential savings of cloud vs. on-premise data centers Part 2: Challenges (and Anxieties) Section 1: Technical Challenges Data transformation challenges: Schema changes, complex data mappings Network bandwidth and latency: Transferring large datasets efficiently Performance testing and load balancing: Ensuring new systems can handle the workload Live data consistency: Maintaining data integrity while updates occur in the source system Minimizing Lag: Techniques to reduce delays in replicating changes to the new system Change data capture: Identifying and tracking changes to the source system during migration Section 2: Operational Challenges Minimizing downtime: Strategies for service continuity during migration Change management and rollback plans: Dealing with unexpected issues Technical skills and resources: In-house expertise/data teams/external help Section 3: Security & Compliance Challenges Data encryption and protection: Methods for both in-transit and at-rest data Meeting audit requirements: Documenting data lineage & the chain of custody Managing access controls: Adjusting identity and role-based access to the new systems Part 3: Patterns Section 1: Infrastructure Migration Strategies Lift and shift: Migrating as-is vs. modernization and re-architecting during the move Phased vs. big bang approaches: Tradeoffs in risk vs. disruption Tools and automation: Using specialized software to streamline the process Dual writes: Managing updates to both old and new systems for a time Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes Data validation & reconciliation: Ensuring consistency between source and target Section 2: Maintaining Performance and Reliability Disaster recovery planning: Failover mechanisms for the new environment Monitoring and alerting: Proactively identifying and addressing issues Capacity planning and forecasting growth to scale the new infrastructure Section 3: Data Consistency and Replication Replication tools - strategies and specialized tooling Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full) Testing/Verification Strategies for validating data correctness in a live environment Implication of large scale systems/environments Comparison of interesting strategies: DBLog, Debezium, Databus, Goldengate etc What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations? When is a migration the wrong choice? What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future? Contact Info LinkedIn (https://www.linkedin.com/in/srirampanyam/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links DagKnows (https://dagknows.com) Google Cloud Dataflow (https://cloud.google.com/dataflow) Seinfeld Risk Management (https://www.youtube.com/watch) ACL == Access Control List (https://en.wikipedia.org/wiki/Access-control_list) LinkedIn Databus - Change Data Capture (https://github.com/linkedin/databus) Espresso Storage (https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) Kafka (https://kafka.apache.org/) Postgres Replication Slots (https://www.postgresql.org/docs/current/logical-replication.html) Queueing Theory (https://en.wikipedia.org/wiki/Queueing_theory) Apache Beam (https://beam.apache.org/) Debezium (https://debezium.io/) Airbyte (https://airbyte.com/) Fivetran (fivetran.com) Designing Data Intensive Applications (https://amzn.to/4aAztR1) by Martin Kleppman (https://martin.kleppmann.com/) (affiliate link) Vector Databases (https://en.wikipedia.org/wiki/Vector_database) Pinecone (https://www.pinecone.io/) Weaviate (https://www.weveate.io/) LAMP Stack (https://en.wikipedia.org/wiki/LAMP_(software_bundle)) Netflix DBLog (https://arxiv.org/abs/2010.12597) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Zenlytic Is Building You A Better Coworker With AI Agents

Data Engineering Podcast

Play Episode Listen Later May 19, 2024 54:19

Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ryan Janssen and Paul Blankley about their experiences building AI powered agents for interacting with your data Interview Introduction How did you get involved in data? In AI? Can you describe what Zenlytic is and the role that AI is playing in your platform? What have been the key stages in your AI journey? What are some of the dead ends that you ran into along the path to where you are today? What are some of the persistent challenges that you are facing? So tell us more about data agents. Firstly, what are data agents and why do you think they're important? How are data agents different from chatbots? Are data agents harder to build? How do you make them work in production? What other technical architectures have you had to develop to support the use of AI in Zenlytic? How have you approached the work of customer education as you introduce this functionality? What are some of the most interesting or erroneous misconceptions that you have heard about what the AI can and can't do? How have you balanced accuracy/trustworthiness with user experience and flexibility in the conversational AI, given the potential for these models to create erroneous responses? What are the most interesting, innovative, or unexpected ways that you have seen your AI agent used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI agent for business intelligence? When is an AI agent the wrong choice? What do you have planned for the future of AI in the Zenlytic product? Contact Info Ryan LinkedIn (https://www.linkedin.com/in/janssenryan) Paul LinkedIn (https://www.linkedin.com/in/paulblankley/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Zenlytic (https://www.zenlytic.com/) Podcast Episode (https://www.dataengineeringpodcast.com/zenlytic-self-serve-business-intelligence-episode-371) Attention is all you need (https://arxiv.org/abs/1706.03762) Transformers (https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) BERT (https://en.wikipedia.org/wiki/BERT_(language_model)) The Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) Richard Sutton PID Loops (https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller) AutoGPT (https://github.com/Significant-Gravitas/AutoGPT) Devin.ai (https://www.cognition.ai/introducing-devin) Google Gemini (https://gemini.google.com/) Anthropic Claude (https://www.anthropic.com/claude) OpenAI Code Interpreter (https://platform.openai.com/docs/assistants/tools/code-interpreter) Edward Tufte (https://www.edwardtufte.com/tufte/books_vdqi) Looker ActionHub (https://developers.looker.com/actions/overview/) OAuth (https://oauth.net/2/) GitHub Copilot (https://github.com/features/copilot) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Release Management For Data Platform Services And Logic

Data Engineering Podcast

Play Episode Listen Later May 12, 2024 20:08

Summary Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I want to talk about my experiences managing the QA and release management process of my data platform Interview Introduction As a team, our overall goal is to ensure that the production environment for our data platform is highly stable and reliable. This is the foundational element of establishing and maintaining trust with the consumers of our data. In order to support this effort, we need to ensure that only changes that have been tested and verified are promoted to production. Our current challenge is one that plagues all data teams. We want to have an environment that mirrors our production environment that is available for testing, but it's not feasible to maintain a complete duplicate of all of the production data. Compounding that challenge is the fact that each of the components of our data platform interact with data in slightly different ways and need different processes for ensuring that changes are being promoted safely. Contact Info LinkedIn () Website (https://www.dataengineeringpodcast.com) Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Data Platforms and Leaky Abstractions Episode (https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374) Building A Data Platform From Scratch (https://www.dataengineeringpodcast.com/designing-a-lakehouse-from-scratch-episode-354) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Trino (https://trino.io/) dbt (https://www.getdbt.com/) Starburst Galaxy (https://www.starburst.io/platform/starburst-galaxy/) Superset (https://superset.apache.org/) Dagster (https://dagster.io/) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) Nessie (https://projectnessie.org/) Podcast Episode (https://www.dataengineeringpodcast.com/nessie-data-lakehouse-data-versioning-episode-416) Iceberg (https://iceberg.apache.org/) Snowflake (https://www.snowflake.com/en/) LocalStack (https://www.localstack.cloud/) DSL == Domain Specific Language (https://en.wikipedia.org/wiki/Domain-specific_language) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Data Engineering Podcast

Play Episode Listen Later May 5, 2024 54:16

Summary Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than statistical prediction. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Peter Voss about what is involved in making your AI applications more "human" Interview Introduction How did you get involved in machine learning? Can you start by unpacking the idea of "human-like" AI? How does that contrast with the conception of "AGI"? The applications and limitations of GPT/LLM models have been dominating the popular conversation around AI. How do you see that impacting the overrall ecosystem of ML/AI applications and investment? The fundamental/foundational challenge of every AI use case is sourcing appropriate data. What are the strategies that you have found useful to acquire, evaluate, and prepare data at an appropriate scale to build high quality models? What are the opportunities and limitations of causal modeling techniques for generalized AI models? As AI systems gain more sophistication there is a challenge with establishing and maintaining trust. What are the risks involved in deploying more human-level AI systems and monitoring their reliability? What are the practical/architectural methods necessary to build more cognitive AI systems? How would you characterize the ecosystem of tools/frameworks available for creating, evolving, and maintaining these applications? What are the most interesting, innovative, or unexpected ways that you have seen cognitive AI applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on desiging/developing cognitive AI systems? When is cognitive AI the wrong choice? What do you have planned for the future of cognitive AI applications at Aigo? Contact Info LinkedIn (https://www.linkedin.com/in/vosspeter/) Website (http://optimal.org/voss.html) Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Aigo.ai (https://aigo.ai/) Artificial General Intelligence (https://aigo.ai/what-is-real-agi/) Cognitive AI (https://aigo.ai/cognitive-ai/) Knowledge Graph (https://en.wikipedia.org/wiki/Knowledge_graph) Causal Modeling (https://en.wikipedia.org/wiki/Causal_model) Bayesian Statistics (https://en.wikipedia.org/wiki/Bayesian_statistics) Thinking Fast & Slow (https://amzn.to/3UJKsmK) by Daniel Kahneman (affiliate link) Agent-Based Modeling (https://en.wikipedia.org/wiki/Agent-based_model) Reinforcement Learning (https://en.wikipedia.org/wiki/Reinforcement_learning) DARPA 3 Waves of AI (https://www.darpa.mil/about-us/darpa-perspective-on-ai) presentation Why Don't We Have AGI Yet? (https://arxiv.org/abs/2308.03598) whitepaper Concepts Is All You Need (https://arxiv.org/abs/2309.01622) Whitepaper Hellen Keller (https://en.wikipedia.org/wiki/Helen_Keller) Stephen Hawking (https://en.wikipedia.org/wiki/Stephen_Hawking) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

Build Your Second Brain One Piece At A Time

Data Engineering Podcast

Play Episode Listen Later Apr 28, 2024 50:10

Summary Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Tsavo Knott about Pieces, a personal AI toolkit to improve the efficiency of developers Interview Introduction How did you get involved in machine learning? Can you describe what Pieces is and the story behind it? The past few months have seen an endless series of personalized AI tools launched. What are the features and focus of Pieces that might encourage someone to use it over the alternatives? model selections architecture of Pieces application local vs. hybrid vs. online models model update/delivery process data preparation/serving for models in context of Pieces app application of AI to developer workflows types of workflows that people are building with pieces What are the most interesting, innovative, or unexpected ways that you have seen Pieces used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pieces? When is Pieces the wrong choice? What do you have planned for the future of Pieces? Contact Info LinkedIn (https://www.linkedin.com/in/tsavoknott/) Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Pieces (https://pieces.app/) NPU == Neural Processing Unit (https://en.wikipedia.org/wiki/AI_accelerator) Tensor Chip (https://en.wikipedia.org/wiki/Google_Tensor) LoRA == Low Rank Adaptation (https://github.com/microsoft/LoRA) Generative Adversarial Networks (https://en.wikipedia.org/wiki/Generative_adversarial_network) Mistral (https://mistral.ai/) Emacs (https://www.gnu.org/software/emacs/) Vim (https://www.vim.org/) NeoVim (https://neovim.io/) Dart (https://dart.dev/) Flutter (https://flutter.dev/) Typescript (https://www.typescriptlang.org/) Lua (https://www.lua.org/) Retrieval Augmented Generation (https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/) ONNX (https://onnx.ai/) LSTM == Long Short-Term Memory (https://en.wikipedia.org/wiki/Long_short-term_memory) LLama 2 (https://llama.meta.com/llama2/) GitHub Copilot (https://github.com/features/copilot) Tabnine (https://www.tabnine.com/) Podcast Episode (https://www.themachinelearningpodcast.com/tabnine-generative-ai-developer-assistant-episode-24) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

Making Email Better With AI At Shortwave

Data Engineering Podcast

Play Episode Listen Later Apr 21, 2024 53:43

Summary Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Lee about his work on Shortwave, an AI powered email client Interview Introduction How did you get involved in the area of data management? Can you describe what Shortwave is and the story behind it? What is the core problem that you are addressing with Shortwave? Email has been a central part of communication and business productivity for decades now. What are the overall themes that continue to be problematic? What are the strengths that email maintains as a protocol and ecosystem? From a product perspective, what are the data challenges that are posed by email? Can you describe how you have architected the Shortwave platform? How have the design and goals of the product changed since you started it? What are the ways that the advent and evolution of language models have influenced your product roadmap? How do you manage the personalization of the AI functionality in your system for each user/team? For users and teams who are using Shortwave, how does it change their workflow and communication patterns? Can you describe how I would use Shortwave for managing the workflow of evaluating, planning, and promoting my podcast episodes? What are the most interesting, innovative, or unexpected ways that you have seen Shortwave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Shortwave? When is Shortwave the wrong choice? What do you have planned for the future of Shortwave? Contact Info LinkedIn (https://www.linkedin.com/in/startupandrew/) Blog (https://startupandrew.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Shortwave (https://www.shortwave.com/) Firebase (https://firebase.google.com/) Google Inbox (https://en.wikipedia.org/wiki/Inbox_by_Gmail) Hey (https://www.hey.com/) Ezra Klein Hey Article (https://www.nytimes.com/2024/04/07/opinion/gmail-email-digital-shame.html) Superhuman (https://superhuman.com/) Pinecone (https://www.pinecone.io/) Podcast Episode (https://www.dataengineeringpodcast.com/pinecone-vector-database-similarity-search-episode-189/) Elastic (https://www.elastic.co/) Hybrid Search (https://weaviate.io/blog/hybrid-search-explained) Semantic Search (https://en.wikipedia.org/wiki/Semantic_search) Mistral (https://mistral.ai/) GPT 3.5 (https://platform.openai.com/docs/models/gpt-3-5-turbo) IMAP (https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Designing A Non-Relational Database Engine

Data Engineering Podcast

Play Episode Listen Later Apr 14, 2024 76:01

Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine Interview Introduction How did you get involved in the area of data management? Can you describe what constitutes a NoSQL database? How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago? What are the factors that convince teams to use a NoSQL vs. SQL database? NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus? How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines? When designing and building a database, what are the initial set of questions that need to be answered? How many "core capabilities" can you reasonably design around before they conflict with each other? How have you approached the evolution of RavenDB as you add new capabilities and mature the project? What are some of the early decisions that had to be unwound to enable new capabilities? If you were to start from scratch today, what database would you build? What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RavenDB? When is a NoSQL database/RavenDB the wrong choice? What do you have planned for the future of RavenDB? Contact Info Blog (https://ayende.com/blog/) LinkedIn (https://www.linkedin.com/in/ravendb/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links RavenDB (https://ravendb.net/) RSS (https://en.wikipedia.org/wiki/RSS) Object Relational Mapper (ORM) (https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping) Relational Database (https://en.wikipedia.org/wiki/Relational_database) NoSQL (https://en.wikipedia.org/wiki/NoSQL) CouchDB (https://couchdb.apache.org/) Navigational Database (https://en.wikipedia.org/wiki/Navigational_database) MongoDB (https://www.mongodb.com/) Redis (https://redis.io/) Neo4J (https://neo4j.com/) Cassandra (https://cassandra.apache.org/_/index.html) Column-Family (https://en.wikipedia.org/wiki/Column_family) SQLite (https://www.sqlite.org/) LevelDB (https://github.com/google/leveldb) Firebird DB (https://firebirdsql.org/) fsync (https://man7.org/linux/man-pages/man2/fsync.2.html) Esent DB? (https://learn.microsoft.com/en-us/windows/win32/extensible-storage-engine/extensible-storage-engine-managed-reference) KNN == K-Nearest Neighbors (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) RocksDB (https://rocksdb.org/) C# Language (https://en.wikipedia.org/wiki/C_Sharp_(programming_language)) ASP.NET (https://en.wikipedia.org/wiki/ASP.NET) QUIC (https://en.wikipedia.org/wiki/QUIC) Dynamo Paper (https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) Database Internals (https://amzn.to/49A5wjF) book (affiliate link) Designing Data Intensive Applications (https://amzn.to/3JgCZFh) book (affiliate link) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

Play Episode Listen Later Apr 7, 2024 56:23

Summary Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform Interview Introduction How did you get involved in the area of data management? Can you start by outlining the technical elements of what it means to have a "semantic layer"? In the past couple of years there was a rapid hype cycle around the "metrics layer" and "headless BI", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts? What are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.) At what point does it become necessary/beneficial for a team to adopt such a service? What are the challenges involved in retrofitting a semantic layer into a production data system? evolution of requirements/usage patterns technical complexities/performance and cost optimization What are the most interesting, innovative, or unexpected ways that you have seen Cube used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube? When is Cube/a semantic layer the wrong choice? What do you have planned for the future of Cube? Contact Info LinkedIn (https://www.linkedin.com/in/keydunov/) keydunov (https://github.com/keydunov) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Cube (https://cube.dev/) Semantic Layer (https://en.wikipedia.org/wiki/Semantic_layer) Business Objects (https://en.wikipedia.org/wiki/BusinessObjects) Tableau (https://www.tableau.com/) Looker (https://cloud.google.com/looker/?hl=en) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) Mode (https://mode.com/) Thoughtspot (https://www.thoughtspot.com/) LightDash (https://www.lightdash.com/) Podcast Episode (https://www.dataengineeringpodcast.com/lightdash-exploratory-business-intelligence-episode-232/) Embedded Analytics (https://en.wikipedia.org/wiki/Embedded_analytics) Dimensional Modeling (https://en.wikipedia.org/wiki/Dimensional_modeling) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Druid (https://druid.apache.org/) BigQuery (https://cloud.google.com/bigquery?hl=en) Starburst (https://www.starburst.io/) Pinot (https://pinot.apache.org/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Arrow Datafusion (https://arrow.apache.org/datafusion/) Metabase (https://www.metabase.com/) Podcast Episode (https://www.dataengineeringpodcast.com/metabase-with-sameer-al-sakran-episode-29) Superset (https://superset.apache.org/) Alation (https://www.alation.com/) Collibra (https://www.collibra.com/) Podcast Episode (https://www.dataengineeringpodcast.com/collibra-enterprise-data-governance-episode-188) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Data Engineering Podcast

Play Episode Listen Later Mar 31, 2024 50:44

Summary Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Your host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help Interview Introduction How did you get involved in the area of data management? Can you start by outlining what elements of observability are most relevant for dbt projects? What are some of the common ad-hoc/DIY methods that teams develop to acquire those insights? What are the challenges/shortcomings associated with those approaches? Over the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools? What are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle? Can you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects? How is Elementary designed/implemented? How have the scope and goals of the project changed since you started working on it? What are the engineering challenges/frustrations that you have dealt with in the creation and evolution of Elementary? Can you talk us through the setup and workflow for teams adopting Elementary in their dbt projects? How does the incorporation of Elementary change the development habits of the teams who are using it? What are the most interesting, innovative, or unexpected ways that you have seen Elementary used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Elementary? When is Elementary the wrong choice? What do you have planned for the future of Elementary? Contact Info LinkedIn (https://www.linkedin.com/in/maayansa/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Elementary (https://www.elementary-data.com/) Data Observability (https://www.montecarlodata.com/blog-what-is-data-observability/) dbt (https://www.getdbt.com/) Datadog (https://www.datadoghq.com/) pre-commit (https://pre-commit.com/) dbt packages (https://docs.getdbt.com/docs/build/packages) SQLMesh (https://sqlmesh.readthedocs.io/en/latest/) Malloy (https://www.malloydata.dev/) SDF (https://www.sdf.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Iceberg at Netflix and Beyond with Ryan Blue

Software Engineering Daily

Play Episode Listen Later Mar 7, 2024 47:37

Apache Iceberg is an open source high-performance format for huge data tables. Iceberg enables the use of SQL tables for big data, while making it possible for engines like Spark and Hive to safely work with the same tables, at the same time. Iceberg was started at Netflix by Ryan Blue and Dan Weeks, and The post Iceberg at Netflix and Beyond with Ryan Blue appeared first on Software Engineering Daily.

netflix spark hive iceberg sql software engineering daily apache iceberg dan weeks

Podcasts about apache iceberg

Best podcasts about apache iceberg

Data Engineering Podcast

The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists

Engenharia de Dados [Cast]

AWS Morning Brief

The Cloudcast

Open Source Startup Podcast

The Ravit Show

Latest news about apache iceberg

Latest podcast episodes about apache iceberg

Oracle's Juan Loaiza Discusses Trust Privacy, Security in the Age of AI | Cloud Wars Live

Re-Air: The Data Economy: Turning Information into a Tradable Commodity with Viktor Kessler of Vakamo

How enterprises can enable the Agentic AI Lakehouse on Apache Iceberg

Lakehouse Catalogs Beyond Apache Iceberg, What could they Look Like?

The ERP Minute Episode 205 - September 23rd, 2025

Under the hood of Apache Iceberg (w/ Christian Thiel)

From Iceberg to Insight: Qlik's Vision for AI-Ready Data

Future of Iceberg and Open Lakehouse Announcement

#124 - The Path to AGI: Inside poolside's AI Model Factory for Code with Eiso Kant

WBSP736: Grow Your Business by Learning from Enterprise Software Stories - Feb 2025, Ep 6, an Objective Panel Discussion

249: Quacking Through Data: Duckdb's Emerging Ecosystem

ClickHouse: Breaking the Speed Limit for Observability and Analytics - OpenObservability Talks S5E12

The ERP Minute Episode 187 - May 20th, 2025

How Rising Wave Is Redefining Real-Time Data with Postgres Power

Inside the Mind of Snowflake's CEO: Bold Bets in the AI Arms Race

Episode 301

Trends in Data Engineering – Adrian Brudaru

Will Apache Iceberg and Delta Lake Merge?

LCC 322 - Maaaaveeeeen 4 !

Amazon S3 Tables explained: Better storage for AWS Analytics workloads

The ERP Minute Episode 170 - January 21st, 2025

A Return to Greatness, or Degenerate Day 3?

63 – Reinvent, AWS S3 Table Buckets and Apache Iceberg

AI, Community, and the Future of Generative Applications

Data Lakehouses & Apache Iceberg

61 – What’s New In dbt? (dbt coalesce 2024)

Starburst CEO on Future Of Data-Query Engines

Leveraging Open Source Technologies for Data Lakehouses with Alex Merced, Senior Tech Evangelist at Dremio

267: Why the Data Lakehouse Is the Future—But What's Stopping It from Getting There? - Upsolver

How Apache Iceberg and Flink Can Ease Developer Pain

#681: Amazon DynamoDB Deep Dive

What is a Data Lakehouse with Upsolver's Ori Rafael

E142: Redefining Self-Serve Analytics with Dremio

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Stitching Together Enterprise Analytics With Microsoft Fabric

Being Data Driven At Stripe With Trino And Iceberg

X-Ray Vision For Your Flink Stream Processing With Datorios

Practical First Steps In Data Governance For Long Term Success

Dremio Tech Evangelist Andrew Madson

Data Migration Strategies For Large Scale Systems

Zenlytic Is Building You A Better Coworker With AI Agents

Release Management For Data Platform Services And Logic

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Build Your Second Brain One Piece At A Time

Making Email Better With AI At Shortwave

Designing A Non-Relational Database Engine

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Iceberg at Netflix and Beyond with Ryan Blue