POPULARITY
Is a Data Lakehouse the key to unlocking AI success? I had a great discussion on how Data Lakehouses are shaping the future of AI with Edward Calvesbert, VP, Product Management - watsonx, IBM, on The Ravit Show!We explored the evolving relationship between data quality and AI success, the biggest misconceptions companies have about using their data for AI, and how data architects and engineers need to adapt.We also tackled the challenge of data silos, the explosion of unstructured data, and the crucial balance between security, governance, and scaling AI across the enterprise.AI is only as good as the data behind it—so how do we ensure businesses are using the right data?#watsonx #sponsored #ibm #theravitshow
In this episode, I speak with Vinoth Chandar about Onehouse, a Universal Data Lakehouse and the only data platform instantly accessible from any engine, from BI to AI.Try RaycastWant to improve your productivity on macOS with a Shortcut to everything? Try Raycast, and get 10% off with the link, go.chrischinchilla.com/raycast. For show notes and an interactive transcript, visit chrischinchilla.com/podcast/To reach out and say hello, visit chrischinchilla.com/contact/To support the show for ad-free listening and extra content, visit chrischinchilla.com/support/
In this episode, I sit down with Varun Madan, Head of Engineering at OneHouse, to discuss how startups must operate with slimmer margins—both in decision-making and execution. We dive into the high-stakes hiring process, balancing efficiency with impact, managing context switching, and transitioning between IC and leadership roles.Key Takeaways:✅ Hiring at a startup requires extreme precision. Every hire matters, and balancing speed vs. fit is key to avoiding costly mistakes.✅ Prioritization is everything. Engineering teams need to measure their impact weekly, ensuring they drive value rather than just delivering effort.✅ A structured hiring pipeline saves time. Using data-driven hiring matrices can prevent wasted engineering hours spent on interviews that won't convert.✅ Context switching is unavoidable, but it can be managed. Effective leaders block time on their calendars to focus on key areas without distraction.✅ Blameless cultures drive improvement. Transparent postmortems and shared learning from mistakes help teams get stronger rather than fearful.✅ Moving between IC and leadership roles can be a strategic advantage. Engineers who step back into IC roles often return as better leaders with deeper domain expertise.Timestamped Highlights:
Microsoft Fabric offers two enterprise-scale, open-standard format workloads for data storage: Warehouse and Lakehouse. Which service should you choose? In this episode, we dive into the technical components of OneLake, along with some of the decisions you'll be asked to make as you start to build out your data infrastructure. These are two good articles we mention in the podcast that could help inform your decision on the services to implement in your OneLake. Microsoft Fabric Decision Guide: Choose between Warehouse and Lakehouse - Microsoft Fabric | Microsoft Learn Lakehouse vs Data Warehouse vs Real-Time Analytics/KQL Database: Deep Dive into Use Cases, Differences, and Architecture Designs | Microsoft Fabric Blog | Microsoft Fabric We hope you enjoyed this conversation on the nuances of data storage within Microsoft OneLake! If you have questions or comments, please send them our way. We would love to answer your questions on a future episode. Leave us a comment and some love ❤️ on LinkedIn, X, Facebook, or Instagram. The show notes for today's episode can be found at Episode 283: Data Lakehouse vs Data Warehouse vs My House. Have fun on the SQL Trail!
Pop quiz: What is an AI factory?
Unstructured data is the next frontier for AI: think video, audio, and more. David Nicholson is joined by Dell Technologies' Vice President of Product Management for Artificial Intelligence and Data Management Chad Dunn for a conversation on the strategic importance of high-quality data and the dynamic capabilities of the Dell Data Lakehouse in facilitating effective AI workloads. Highlights include ⤵️ Data quality is paramount: "Garbage in, expensive garbage out" applies more than ever in the age of generative AI Dell's Data Lakehouse: This intelligent platform helps organizations extract, prepare, and analyze data for AI workloads, including both structured and unstructured data with tools like Apache Spark and Trino Customer experiences: The evolving landscape of data challenges in large enterprises Pushing the boundaries: Dell's approach to managing unstructured data and integrating AI Factory visions into Lakehouse functionalities
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Alex Merced discusses what he looks forward to in 2025 in the Data Lakehouse Space. Alex Merced Event Listings: https://lu.ma/LakehouselinkupsAlex on Bluesky: https://bsky.app/profile/alextalksdatalakehouses.fyi Alex on Twitter: https://x.com/AMdatalakehouse Alex on LinkedIn: https://www.linkedin.com/in/alexmerced/
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Register for the catalog Course: https://drmevn.fyi/catalogcourse1024 Watch the Iceberg Crash Course: https://drmevn.fyi/icebergcourse1024 London Meetup: https://lu.ma/Lakehouselinkups Paris Meetup: https://drmevn.fyi/1120-france-meetup My Calendar of Events: https://lu.ma/Lakehouselinkups
Alex Merced (@AMdatalakehouse, Senior Tech Evangelist, @dremio) talks about everything data and we dig deep into Apache Iceberg and DataLakehouses.SHOW: 865Want to go to All Things Open in Raleigh for FREE? (Oct 27th-29th)We are offering 5 Free passes, first come, first serve for the Cloudcast Community -> Registration Link Instructions:Click reg linkClick “Get Tickets”Choose ticket optionProceed with registration (discount will automatically be applied, cost will be $0)SHOW TRANSCRIPT: The Cloudcast #865 TranscriptSHOW VIDEO: https://youtube.com/@TheCloudcastNET CLOUD NEWS OF THE WEEK: - http://bit.ly/cloudcast-cnotwNEW TO CLOUD? CHECK OUT OUR OTHER PODCAST: - "CLOUDCAST BASICS" SHOW NOTES:Dremio (homepage)Hands-on with Apache Iceberg TutorialApache Iceberg Crash CourseData Lakehouses and Apache Hudi (Cloudcast Eps. 694)Apache Iceberg, the Definitive Guide (eBook)Apache Iceberg (homepage)Iceberg + Nessie Catalog (homepage)Iceberg + Polaris Catalog (homepage)AlexMerced.comDataLakehouseHub.comTopic 1 - Welcome to the show. Tell us a little bit about your background. Topic 2 - It's been a little while since we talked about Data Lakehouses, can you give us a little bit of background on this space, and what the most recent dynamics are around these technologies.Topic 3 - What are the typical integrations with a Data Lakehouse? How are users/developers typically interacting with Data Lakehouse technologies? [The marketplace for Iceberg catalogs like Nessie and Polaris]Topic 4 - How does an open data format like Apache Iceberg fit into the bigger picture of data lakehouses, or large scale stores of data? Topic 5 - How does Dremio enable Iceberg? How does Dremio sit in the intersection of Data Lakehouse, Data Mesh and Data Virtualization trends all of which come from the same fundamental problem, the growing scale of data use cases.Topic 6 - We've seen companies start to rethink their data in the cloud strategies. Are you seeing on-premises making a comeback for large data applicationsFEEDBACK?Email: show at the cloudcast dot netTwitter: @cloudcastpodInstagram: @cloudcastpodTikTok: @cloudcastpod
What makes data lakehouses a game changer in modern data management? In this episode, Bill sits down with Alex Merced, Senior Tech Evangelist at Dremio, to explore the evolution of data lakehouses and their role in bridging the gap between data lakes and data warehouses. Alex breaks down the components of data lakehouses and dives into the rise of Apache Iceberg.---------Key Quotes:“I love just get really deep into technology, really see what it does. And then scream at the rooftops how cool it is. And basically that was my charter. And [Apache] Iceberg, the more I learned about it, the more I realized this is really interesting.”“Interoperability and data. Basically, a lot of the things that kept data in silos is now breaking apart.”"So here we're talking about something that's going to be a standard. And that's when I think of the highest levels of openness matter because if it's something that a whole industry is going to build on, it should be something that the whole industry has to say in its evolution…And that's the beauty of openness that it does create these nice sort of places where we can collaborate and compete together.”--------Timestamps: (01:32) How Alex got started in his career(03:54) Breaking down data lakehouses(07:08) The idea behind an open data lakehouse(10:10) Alex's involvement with Apache Iceberg(15:13) Key components of a data lakehouse(23:41) The growth of Apache Iceberg(32:07) Dremio's Apache Iceberg crash course(38:43) Explaining self-service analytics--------Sponsor:Over the Edge is brought to you by Dell Technologies to unlock the potential of your infrastructure with edge solutions. From hardware and software to data and operations, across your entire multi-cloud environment, we're here to help you simplify your edge so you can generate more value. Learn more by visiting dell.com/edge for more information or click on the link in the show notes.--------Credits:Over the Edge is hosted by Bill Pfeifer, and was created by Matt Trifiro and Ian Faison. Executive producers are Matt Trifiro, Ian Faison, Jon Libbey and Kyle Rusca. The show producer is Erin Stenhouse. The audio engineer is Brian Thomas. Additional production support from Elisabeth Plutko.--------Links:Follow Bill on LinkedInFollow Alex on LinkedIn
Ori Rafael, CEO and co-founder of Upsolver explores the future of data management through data lakehouses. He explains the evolution of the lakehouse, a revolutionary architecture that combines the best of data lakes and warehouses. You will gain insights into key technologies like Apache Iceberg, how lakehouses enable advanced use cases such as AI, and how they help businesses reduce costs.
In this episode, we sit down with Ori Rafael, CEO and Co-founder of Upsolver, to explore the rise of the lakehouse architecture and its significance in modern data management. Ori breaks down the origins of the lakehouse and how it leverages S3 to provide scalable and cost-effective storage. We discuss the critical role of open table formats like Apache Iceberg in unifying data lakes and warehouses, and how ETL processes differ between these environments. Ori also shares his vision for the future, highlighting how Upsolver is positioned to empower organizations as they navigate the rapidly evolving data landscape.
Dell Technologies has announced significant performance and connectivity enhancements to its Dell Data Lakehouse platform. These new enhancements are designed to accelerate AI initiatives and streamline data access, providing businesses with fast query speeds, expanded data sources, simplified management and powerful analytics. The key features of the Dell Data Lakehouse v1.1 includes enhance performance, improve connectivity, simplified management and expanded accessibility. Turbocharged performance New Warp Speed technology and high-performance SSDs boost query performance by 3x to 5x through automated learning of query patterns and optimising indexes and caches, allowing businesses to extract insights from data faster than ever before. Improved connectivity Dell Technologies has enhanced connectivity options by securely connecting to an existing Hive Metastore via Kerberos for seamless metadata operations and improved data governance. The new Neo4j graph database connector is now in public preview, and the Snowflake connector has been optimised for efficient querying. Additionally, upgraded connectors for Iceberg, Delta Lake, Hive, and other popular data sources ensure faster and more capable operations. Simplified Management Dell has streamlined operations with new features to ensure system robustness and security and Dell support teams can now easily assess cluster health before or after installation or upgrades, ensuring zero downtime. The system also sends critical hardware failure alerts directly to Dell Support for proactive handling. Additionally, optional end-to-end encryption for internal components is available to secure the Lakehouse. Expanded Accessibility Dell has now introduced and offers a new 5-year software subscription option, complementing the existing 1 and 3-year subscriptions, to align hardware and software support terms. To meet growing demand, the Dell Data Lakehouse is now available in more countries across Europe, Africa, and Asia. Additionally, customers can now access the Dell Data Lakehouse in the Dell Demo Center and soon in the Customer Solution Center for interactive exploration and validation. Speaking about the new updates in Dell's Modern Data Lakehouse, Vrashank Jain, Product Manager - Data Management at Dell Technologies, said, "Dell Data Lakehouse with Warp Speed sets a new benchmark in data lake analytics, empowering organisations to derive insights from their data more quickly and efficiently than ever before. Warp Speed unlocks the full potential of the Dell Data Lakehouse, paving the way for accelerated and budget-friendly innovation and growth in the AI era." To get a full, hands-on experience, visit the Dell Demo Center to interactively explore the Dell Data Lakehouse with labs developed by Dell Technologies' experts. Businesses and organisations can also contact your Dell account executive to explore the Dell Data Lakehouse for your data needs. More about Irish Tech News Irish Tech News are Ireland's No. 1 Online Tech Publication and often Ireland's No.1 Tech Podcast too. You can find hundreds of fantastic previous episodes and subscribe using whatever platform you like via our Anchor.fm page here: https://anchor.fm/irish-tech-news If you'd like to be featured in an upcoming Podcast email us at Simon@IrishTechNews.ie now to discuss. Irish Tech News have a range of services available to help promote your business. Why not drop us a line at Info@IrishTechNews.ie now to find out more about how we can help you reach our audience. You can also find and follow us on Twitter, LinkedIn, Facebook, Instagram, TikTok and Snapchat.
Justin Borgman, Co-Founder and CEO of Starburst, explores the cutting-edge world of data management and analytics. Justin shares insights into Starburst's innovative use of Trino and Apache Iceberg, revolutionizing data warehousing and analytics. Learn about the company's journey, the evolution of data lakes, and the role of data science in modern enterprises. Episode Overview: In Episode 86 of Great Things with Great Tech, Anthony Spiteri chats with Justin Borgman, Co-Founder and CEO of Starburst. This episode dives into the transformative world of data management and analytics, exploring how Starburst leverages cutting-edge technologies like Trino and Apache Iceberg to revolutionize data warehousing. Justin shares his journey from founding Hadapt to leading Starburst, the evolution of data lakes, and the critical role of data science in today's tech landscape. Key Topics Discussed: Starburst's Origins and Vision: Justin discusses the founding of Starburst and the vision to democratize data access and eliminate data silos. Trino and Iceberg: The importance of Trino as a SQL query engine and Iceberg as an open table format in modern data management. Data Democratization: How Starburst enables organizations to perform high-performance analytics on data stored anywhere, avoiding vendor lock-in. Data Science Evolution: Insights into what it takes to become a data scientist today, emphasizing continuous learning and adaptability. Future of Data Management: The shift towards data and AI operating systems, and Starburst's role in shaping this future. Technology and Technology Partners Mentioned: Starburst, Trino, Apache Iceberg, Teradata, Hadoop, SQL, S3, Azure, Google Cloud Storage, Kafka, Dell, Data Lakehouse, AI, Machine Learning, Big Data, Data Governance, Data Ingestion, Data Management, Capacity Management, Data Security, Compliance, Open Source ☑️ Web: https://www.starburst.io ☑️ Support the Channel: https://ko-fi.com/gtwgt ☑️ Be on #GTwGT: Contact via Twitter @GTwGTPodcast or visit https://www.gtwgt.com ☑️ Subscribe to YouTube: https://www.youtube.com/@GTwGTPodcast?sub_confirmation=1 Check out the full episode on our platforms: YouTube: https://youtu.be/kmB_pjGb5Js Spotify: https://open.spotify.com/episode/2l9aZpvwhWcdmL0lErpUHC?si=x3YOQw_4Sp-vtdjyroMk3Q Apple Podcasts: https://podcasts.apple.com/us/podcast/darknet-diaries-with-jack-rhysider-episode-83/id1519439787?i=1000654665731 Follow Us: Website: https://gtwgt.com Twitter: https://twitter.com/GTwGTPodcast Instagram: https://instagram.com/GTwGTPodcast ☑️ Music: https://www.bensound.com
On this episode of Marketing Art and Science, Host and CMO Advisor Lisa Martin is joined by SingleStore CMO Madhukar Kumar. Tune into this 30 min discussion, as they explore the symbiotic value of a product-led growth (PLG) strategy on sales and marketing, the dynamic blend of creativity and analytics in marketing at SingleStore, and the influence of AI and emerging technologies on the customer experience. Their discussion covers: Madhukar Kumar's career journey, transitioning from a journalist to a developer to a tech CMO. PLG strategies that tightly align sales and marketing and facilitate conversion of customers from freemium to enterprise models. The evolving landscape of MarTech and its impact on modern marketing strategies, including the integration of creativity and analytics. Insights into the use of AI, gen AI, and emerging technologies in marketing and their influence on the customer experience and key performance metrics.
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Alex Merced discusses the benefits of Apache Iceberg’s open data ecosystem! Build a Data Lakehouse on Your Laptop Deploy Deploy into Production
Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg Interview Introduction How did you get involved in the area of data management? Can you describe what Nessie is and the story behind it? What are the core problems/complexities that Nessie is designed to solve? The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case? Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec? How do the versioning capabilities compare to/augment the data versioning in Iceberg? What are some of the sources of, and challenges in resolving, merge conflicts between table branches? Can you describe the architecture of Nessie? How have the design and goals of the project changed since it was first created? What is involved in integrating Nessie into a given data stack? For cases where a given query/compute engine doesn't natively support Nessie, what are the options for using it effectively? How does the inclusion of Nessie in a data lake influence the overall workflow of developing/deploying/evolving processing flows? What are the most interesting, innovative, or unexpected ways that you have seen Nessie used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with Nessie? When is Nessie the wrong choice? What have you heard is planned for the future of Nessie? Contact Info LinkedIn (https://www.linkedin.com/in/alexmerced) Twitter (https://www.twitter.com/amdatalakehouse) Alex's Article on Dremio's Blog (https://www.dremio.com/authors/alex-merced/) Alex's Substack (https://amdatalakehouse.substack.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Project Nessie (https://projectnessie.org/) Article: What is Nessie, Catalog Versioning and Git-for-Data? (https://www.dremio.com/blog/what-is-nessie-catalog-versioning-and-git-for-data/) Article: What is Lakehouse Management?: Git-for-Data, Automated Apache Iceberg Table Maintenance and more (https://www.dremio.com/blog/what-is-lakehouse-management-git-for-data-automated-apache-iceberg-table-maintenance-and-more/) Free Early Release Copy of "Apache Iceberg: The Definitive Guide" (https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Arrow (https://arrow.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=6cc46c8c6088) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) AWS Glue (https://aws.amazon.com/glue/) Tabular (https://tabular.io/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Trino (https://trino.io/) Presto (https://prestodb.io/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-with-tomer-shiran-episode-58) RocksDB (https://rocksdb.org/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Hive Metastore (https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore) PyIceberg (https://py.iceberg.apache.org/) Optimistic Concurrency Control (https://en.wikipedia.org/wiki/Optimistic_concurrency_control) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Ravit Jain had a chat with Alex Merced, Developer Advocate at Dremio, during the Chill Data Summit! They discussed Apache Iceberg, Future of Iceberg, Data Lakehouse and much more! #chilldatasummit #theravitshow
Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today. Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg Interview Introduction How did you get involved in the area of data management? To start, can you share your definition of what constitutes a "Data Lakehouse"? What are the technical/architectural/UX challenges that have hindered the progression of lakehouses? What are the notable advancements in recent months/years that make them a more viable platform choice? There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg? What are the key points of comparison for that combination in relation to other possible selections? What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems? What progress is being made (within or across the ecosystem) to address those sharp edges? For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements? What are the differences in terms of pipeline design/access and usage patterns when using a Trino/Iceberg lakehouse as compared to other popular warehouse/lakehouse structures? What are the most interesting, innovative, or unexpected ways that you have seen Trino lakehouses used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data lakehouse ecosystem? When is a lakehouse the wrong choice? What do you have planned for the future of Trino/Starburst? Contact Info LinkedIn (https://www.linkedin.com/in/dainsundstrom/) dain (https://github.com/dain) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Trino (https://trino.io/) Starburst (https://www.starburst.io/) Presto (https://prestodb.io/) JBoss (https://en.wikipedia.org/wiki/JBoss_Enterprise_Application_Platform) Java EE (https://www.oracle.com/java/technologies/java-ee-glance.html) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) S3 (https://aws.amazon.com/s3/) GCS == Google Cloud Storage (https://cloud.google.com/storage?hl=en) Hive (https://hive.apache.org/) Hive ACID (https://cwiki.apache.org/confluence/display/hive/hive+transactions) Apache Ranger (https://ranger.apache.org/) OPA == Open Policy Agent (https://www.openpolicyagent.org/) Oso (https://www.osohq.com/) AWS Lakeformation (https://aws.amazon.com/lake-formation/) Tabular (https://tabular.io/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114) Materialized View (https://en.wikipedia.org/wiki/Materialized_view) Clickhouse (https://clickhouse.com/) Druid (https://druid.apache.org/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Unser heutiger Gast Benjamin Aunkofer ist in folgenden Rollen unterwegs: - Chief Data Scientist & Founder von DATANOMIQ (innovative Data & AI Services for all companies) - Founder & Co-CEO von AUDAVIS (AI-powered Automated Auditing Cloud Platform) - Trainer für Data Science und AI - Interim Head of BI / Process Mining / Data Science - Betreiber des Blogs www.data-science-blog.com ----------------------------------- Wir unterhalten uns unter anderem über diese Themen: 1. Data Scientists sind oft schlecht qualifiziert, haben oft "entweder-oder"-Skills. 2. Unternehmen brauchen immer noch mehr Data Engineers als Data Scientists. 3. Business Intelligence, Process Mining und Data Science werden oft zu sehr voneinander getrennt gesehen. Und Datenlücken sind keine Hindernisse, sondern Findings für Daten- und Prozesstransparenz. 4. Data Lakehousing und Data Mesh servieren die Daten für BI, Process Mining und Data Science zu gleich. Unternehmen verlieren Geld mit doppelter Datenbereitstellung/-haltung. 5. KI automatisiert nicht nur Medien, Finanzberichte und Wirtschaftsprüfung, sondern ersetzt auch Data Engineers / Scientists. ----------------------------------- 0:00 - Freelancer sein ist nicht leicht und Weg zu DATANOMIQ 7:00 - Ein Blick auf das Data Science Studium 17:28 - Viele Data Scientists sind keine Data Scientists 25:00 - Data Warehouse, Data Lakehouse und Machine Learning 36:10 - Was macht Benjamin und was macht DATANOMIQ eigentlich? 40:30 - Hier verlieren Unternehmen am meisten Geld in Sachen DATA 43:50 - Aufbau einer Datenabteilung 47:00 - Gute Weiterbildung ist der Schlüssel für Data Scientists 52:15 - 150.000€+ Gehalt für Top Data Scientists 54:00 - Gründung von AUDAVIS 59:00 - Hybride AI 1:11:38 - Zukunftsausblick ----------------------------------- Weiterführende Informationen: ► LinkedIn Benjamin: https://www.linkedin.com/in/benjamin-aunkofer-98710714/ ► LinkedIn Bernard: https://www.linkedin.com/in/bernardsonnenschein/ ► Nutze jetzt den Code "Friends20" auf https://www.eventbrite.de/e/dataunplugged-tickets-686542897287, um einen Rabatt von 20% auf das Ticket zu erhalten. ► Wir danken außerdem unserem Partner, der Public Cloud Group (PCG): https://hubs.li/Q02cH6qN0
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Find all my data resources below:https://bio.alexmerced.com/data Listen to the State of the Data Lakehouse Podcast Here:https://em360tech.com/podcast/dremio-state-data-lakehouse?utm_source=podcasts&utm_medium=podcast&utm_content=content&utm_campaign=alexmercedcontent&utm_term=iceberg+lakehouse+nessie
The data lakehouse has been quickly gaining popularity within the data management and analytics space. Combining elements of data lakes and data warehouses, the data lakehouse aims to address the challenges associated with both in a way which helps companies reach their data and business goals. In this episode of the EM360 Podcast, Christina Stathopoulos speaks to Read Maloney, CMO at Dremio, as they discuss:The state of the data lakehouseHow an effective lakehouse strategy can be the key to digital transformationHow data lakehouses empower business
The data lakehouse has been quickly gaining popularity within the data management and analytics space. Combining elements of data lakes and data warehouses, the data lakehouse aims to address the challenges associated with both in a way which helps companies reach their data and business goals. In this episode of the EM360 Podcast, Christina Stathopoulos speaks to Read Maloney, CMO at Dremio, as they discuss:The state of the data lakehouseHow an effective lakehouse strategy can be the key to digital transformationHow data lakehouses empower business
Big Data, Data Lakes e Data Ponds. No Episódio de hoje do Entre Chaves, nossos hosts discutem tecnologias de dados com nossa convidada, Rafaela Milagres, engenheira de dados na dti. Venha aprender mais conosco sobre as melhores práticas e maneiras de tratar e armazenar seus dados importantes, e descobrir mais sobre as novas regras inseridas pela LGPD nesses processos. Ouça agora e aprenda mais sobre as tecnologias e desafios de cada uma. Dê o play já! Apresentação: Fernanda Vieira e Lucas Campregher Pessoas convidadas: Rafaela Milagres
Brian Suk, Associate CTO at SADA hosts episode 166 of Cloud N Clear to discuss all things data in a hybrid cloud world. He is joined by Adrian Estala, VP Field CDO at Starburst – a data-driven company offering a full-featured data lake analytics platform, built on open source Trino. Learn more about simplifying customer pipelines to make data more useful, insights on Data Lakehouse tools, and how to get your data ‘right' and get it right fast. Join us in this engaging episode, and don't forget to LIKE, SHARE, & SUBSCRIBE for more enlightening content! ✅
No episódio de hoje, Luan Moreno, Mateus Oliveira e Orlando Marley entrevistam Bill Inmon, criador do conceito de Data Warehouse e escritor de diversos livros com temáticas voltadas para dados.Data Warehouse é o conceito de centralização de dados analíticos das organizações, de forma estruturar um visão 360° do business. Neste episódio, você irá aprender: Diferenças entre OLTP e OLAP;Histórico dos dados para tomada de decisão;Criar um processo resiliente para entender os fatos dos dados.Falamos também, neste bate-papo, sobre os seguintes temas: História do Bill Inmon;Pilares de sistemas analíticos;Nova geração de plataforma de dados analíticos;Aprenda mais sobre análise de dados, como utilizar tecnologias para tornar o seu ambiente analítico confiável e resiliente com as palavras do pai do Data Warehouse. Bill Inmon = Linkedin Luan Moreno = https://www.linkedin.com/in/luanmoreno/
React fast to changes in data with an automated system of detection and action using Data Activator. Monitor and track changes at a granular level as they happen, instead of at an aggregate level, where important insights may be left in the detail and have already become a problem. As a domain expert, this provides a no code way to take data, whether real-time streaming from your IoT devices, or batch data collected from your business systems, and dynamically monitor patterns by establishing conditions. When these conditions are met, Data Activator automatically triggers specific actions, such as notifying dedicated teams or initiating system-level remediations. Join Will Thompson, Group Product Manager for Data Activator, as he shares how to monitor granular high volume of operational data and translate it into specific actions. ► QUICK LINKS: 00:00 - Monitor and track operational data in real-time 00:53 - Demo: Logistics company use case 02:49 - Add a condition 04:04 - Test actions 04:36 - Batch data 06:21 - Trigger an automated workflow 07:12 - How it works 08:12 - Wrap up ► Link References Get started at https://aka.ms/dataActivatorPreview Check out the Data Activator announcement blog at https://aka.ms/dataActivatorBlog ► Unfamiliar with Microsoft Mechanics? As Microsoft's official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. • Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries • Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog • Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast ► Keep getting this insider knowledge, join us on social: • Follow us on Twitter: https://twitter.com/MSFTMechanics • Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ • Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ • Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics
Welcome to another episode of Category Visionaries — the show that explores GTM stories from the tech's most innovative B2B founders. In today's episode, we're speaking with Vinoth Chandar, founder of Onehouse, a cloud-native data Lakehouse solution that has raised $33 Million in funding. Here are the most interesting points from our conversation: Vinoth's Rich Background: Vinoth has a deep technical background in database and distributed systems engineering, with significant stints at Oracle, LinkedIn, Uber, and Confluent before founding Onehouse. Innovating at Uber: Vinoth was part of the team at Uber that built the world's first Data Lakehouse in 2016, a precursor to the modern data Lakehouse concept, showcasing his pioneering work in data management. Lessons from Uber's Turbulent Times: Vinoth's time at Uber during its most tumultuous periods taught him resilience and the importance of a positive product impact, underscoring the critical role of culture and team dynamics in navigating challenges. Admiration for Linus Torvalds: Vinoth admires Linus Torvalds for demonstrating how open-source software can create immense value over time, reflecting on the transformative power of collaborative software development. Influential Reads: Books like "The Journey is the Reward," a biography of Steve Jobs, and "Crossing the Chasm" have profoundly impacted Vinoth, offering insights into market innovation and the importance of category creation. Onehouse's Mission: Onehouse aims to revolutionize data management with its faster, better, and cheaper cloud data infrastructure, leveraging Apache Hoodie to enable efficient data processing and management. Navigating Product Market Fit: Identifying and refining Onehouse's customer profile and value proposition in a crowded and noisy market has been a significant challenge, with open-source community engagement playing a crucial role in building credibility and trust. Future Vision: Vinoth envisions Onehouse completing its product story to bring an interoperable cloud data plane to life, focusing on managing and transforming data effectively for businesses.
Sports analytics requires video, scouting textual reports, streaming data, numerical results of games. How can these types of analytics be accomplished? Where does Generative AI fit? Enter the data lakehouse. Ari Kaplan, the real money ball guy, will share his experience with Tim and Juan.
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
In just a few commands, you can have everything you need to practice ingestion and querying with popular data software. Just install Docker and then run the commands in the image. You can also follow the directions in this blog:https://lnkd.in/eDiC8fc6 Also try out this video series:https://lnkd.in/gp843ErM
With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and analyze these varied outputs into a single manageable system. In the final episode of the season, hosts Lois Houston and Nikita Abraham, along with Greg Genovese, discuss Oracle Data Lakehouse, the premier solution for leveraging data to make better, more profitable business decisions. Oracle MyLearn: https://mylearn.oracle.com/ Oracle University Learning Community: https://education.oracle.com/ou-community LinkedIn: https://www.linkedin.com/showcase/oracle-university/ Twitter: https://twitter.com/Oracle_Edu Special thanks to Arijit Ghosh, David Wright, Ranbir Singh, and the OU Studio Team for helping us create this episode. -------------------------------------------------------- Episode Transcript: 00;00;00;00 - 00;00;39;03 Welcome to the Oracle University Podcast, the first stop on your cloud journey. During this series of informative podcasts, we'll bring you foundational training on the most popular Oracle technologies. Let's get started! Hello and welcome to the Oracle University Podcast. I'm Nikita Abraham, Principal Technical Editor with Oracle University, and with me is Lois Houston, Director of Product Innovation and Go to Market Programs. 00;00;39;06 - 00;01;17;11 Hi there! Last week, we spoke about managing Oracle Database with REST APIs and also looked at ADB built-in tools. Today's episode is the last one of the season, and we're going to be joined by Oracle Database Specialist Greg Genovese, who will talk with us about Oracle Data Lakehouse. Hi, Greg. I've heard about data lakes and data warehouses, but what's a lakehouse? Traditionally, when deciding how to best increase their data productivity and liquidity, companies often find themselves having to make a choice between leveraging a data lake or a data warehouse, each of them having their own benefits and drawbacks. 00;01;17;13 - 00;01;43;20 Now, companies no longer need to make that choice. Instead, they can look to a broader strategy that offers highly accurate machine learning capabilities, the flexibility of using open-source services, and the superior data and analytics capabilities of the best-in-class Oracle Database and Data Warehouse. These capabilities are integrated with common identity, data integration, orchestration, and catalog into a unified architecture - the Oracle Lakehouse. 00;01;43;24 - 00;02;12;26 What are the benefits of Oracle Lakehouse? Oracle Lakehouse facilitates ease of data reuse and recombination, maximizing insights from your data and generating several other benefits, including pure cost savings, as well as improving the agility of your current data warehouse by easily extending with new metrics, details, or attributes, which help you better understand your customers, your processes, or your risks, all while using your existing applications. 00;02;13;01 - 00;02;33;09 Is this only for companies that are already using Oracle Cloud Infrastructure? For those of you companies who haven't yet adopted Oracle Cloud Infrastructure, but instead have existing data lakes on AWS or Azure, if you still want to make that data available to the Oracle Autonomous Database, you can reach out to these data lakes using Oracle SQL. 00;02;33;10 - 00;02;57;17 Here at OCI, we feel your experience would not be a productive one if you weren't allowed to use your choice of tools and applications, such as Analytics Cloud, Tableau, Looker, Notebook, REST, Python, and more. Can you tell us more about how Oracle Data Lakehouse works? It combines current data warehouse and data lake components with capabilities to also include external or third-party data. 00;02;57;20 - 00;03;29;04 This effectively eliminates data silos or having to manually move data between data warehouses and data lakes if you leverage both currently. The five key elements of the Oracle Lakehouse are the data warehouse, the data lake for raw data normally used for loading and staging data, managed open-source services to support Spark, Hadoop, and Redis, data integration, moving data depending on use case, and data catalog, which maintains a complete view of the available data for discovery and governance. 00;03;29;07 - 00;03;49;29 With these elements, you can write the data once with any engine and analyze or even build machine learning modules from any of your current data. How did the idea for data lakehouse come about? What was the need for it? Using all data to innovate, this is the challenge, to include all of your data and use it to drive better, more profitable business decisions. 00;03;50;02 - 00;04;14;07 Some data is easy to access, but accessing all of your data and then correlating that data in a way that helps make decisions and drive better outcomes isn't easy. So, the opportunity we've identified here is harnessing the power of all that data and creating a competitive advantage from it. But how do we do that? How do we run and maintain what we've got today efficiently, quickly, and securely? 00;04;14;08 - 00;04;42;02 We have functions that move data from sources to outcomes. The process is taking the source, going through integrations, and connecting the different data. Once we've done this, traditionally, we looked at persistence, processing the data and storing it somewhere to pass along for analysis. This has connected and curated the data for outcomes. The Oracle Lakehouse is a solution leveraging multiple tools and products to get the desired outcomes from this process. 00;04;42;04 - 00;05;05;17 You can use existing data warehouses to start, and the data warehouse, especially the Converged Autonomous Database, allows for storing all types of data. This is for the relational structured data to store in an Oracle autonomous database or warehouse. The Autonomous Data Warehouse is self-managed with better performance and efficiencies to help focus on the analysis and the outcomes of the data. 00;05;05;20 - 00;05;23;12 The unstructured or raw data can be persisted in any data type in its current format within object storage. This can be within an existing data lake, for example. Object storage is an efficient manner to land data where it's needed. 00;05;23;15 - 00;05;52;12 Are you attending Oracle CloudWorld 2023? Learn from experts, network with peers, and find out about the latest innovations when Oracle CloudWorld returns to Las Vegas from September 18 through 21. CloudWorld is the best place to learn about Oracle solutions from the people who build and use them. In addition to your attendance at CloudWorld, your ticket gives you access to Oracle MyLearn and all of the cloud learning subscription content, as well as three free certification exam credits. 00;05;52;18 - 00;06;28;13 This is valid from the week you register through 60 days after the conference. So, what are you waiting for? Register today. Learn more about Oracle CloudWorld at www.oracle.com/cloudworld. Welcome back. Okay, so Greg, you spoke about the start of data lakehouse. Tell us about data integration and analysis. Lakehouse provides for an all encompassing orchestration of integration and is allowing your choice of tools to keep your source of truth and compliance for your data. 00;06;28;16 - 00;07;03;00 Whether you decide to deploy Oracle GoldenGate, the premiere data integration tool, Oracle Data Integration, helping you move data within the lake, or even an open-source or third-party tool, Lakehouse is by design flexible and meant to fit your specific needs. Oracle Analytics Cloud is used to perform predictive analytics, and other third-party tools can read into the data from the database APIs or using SQL. Oracle AI Service has machine learning models that will continue to work with the transactional systems and bring in other data types as well. 00;07;03;03 - 00;07;35;14 OCI Data Science can harness all of the data for better business outcomes and fills in the tools for integration and analysis for the Oracle Data Lakehouse. Within the Autonomous Data Warehouse, we have transactional and dimensional query capabilities, but in our Lakehouse story, we're also very lucky to have products like MySQL HeatWave, the blazing fast in-memory query accelerator, which increases MySQL performance by orders of magnitude for analytics and mixed workloads, all without any changes to your existing applications. 00;07;35;16 - 00;08;00;20 Really, no other cloud provider is going to give you that much choice in the data warehouse bucket and managed open-source components. So, from what I understand, Lakehouse has options for all types of data, but what about understanding and managing the metadata of data sources? The OCI Data Catalog captures whether you're building a schema, building a query from ADW, or building a table that you want to query from a Spark job. 00;08;00;23 - 00;08;26;13 And all that data definition goes into the OCI data catalog. So, wherever this data goes, you'll be able to access it. The data catalog is the source of truth for object store metadata and can regularly harvest the information from the data sources. It also manages the business glossary, providing consistent terms and tags for your data. Discovery of data is a powerful search feature to discover new data sets entirely. 00;08;26;15 - 00;08;50;08 Even with all these capabilities, there are still more being added or enhanced over time. For example, now with OCI Data Flow, you have a serverless Spark service. You can build a Spark job that makes sense from some unstructured data and include it as a part of the Oracle Lakehouse. Enterprises are moving to data flow because you can write, decode, and execute code, and focus on the application, because the challenging part of where this is running is handled through the service components of the Oracle Lakehouse. 00;08;50;11 - 00;09;13;06 I think what we all want, Greg, is faster insights on our data, right? As you put everything together into this architecture. The key thing is that you want to be able to write data once and then combine it with other previously written data, move it around, combine it here and there, and analyze. 00;09;13;09 - 00;09;36;26 So, we have a way to store both structured and unstructured data. You have the object store for unstructured data and write your structured data to a relational database, perhaps MySQL or Oracle database, and you can then leverage the Oracle Data Catalog to have a single way to understand and tag your data. Oracle Data Lakehouse is an open and collaborative approach. 00;09;36;28 - 00;10;04;22 It stores all data in an order that's easy to understand and analyze through a variety of services, as well as AI tools. OCI can accelerate your solution development for your most common Data Lakehouse workloads. You can easily get started from where you are today, and often without writing any new code whatsoever. Within each path, we can work with you at Oracle to highlight the investments we've made that will help accelerate your own Lakehouse transformation. 00;10;04;24 - 00;10;39;11 The Oracle Data Lakehouse is the premier solution for transforming data into better, more profitable business decisions. Remember, it's not just your architecture that's powerful. With Oracle Lakehouse, you can help combine the architecture, data sets, services, and tools across your entire technical landscape into something more valuable than just the sum of its parts. Thank you so much, Greg, for sharing your expertise with us. To learn more about Oracle Data Lakehouse, please visit mylearn.oracle.com and take a look at our Oracle Cloud Data Management Foundations Workshop. 00;10;39;18 - 00;11;04;14 That brings us to the end of this season. Thank you for being with us on this journey. We're very excited about our upcoming season, which will be dedicated to Cloud Applications Business Process training. Until next time, this is Lois Houston and Nikita Abraham signing off. That's all for this episode of the Oracle University Podcast. If you enjoyed listening, please click Subscribe to get all the latest episodes. 00;11;04;16 - 00;13;38;07 We'd also love it if you would take a moment to rate and review us on your podcast app. See you again on the next episode of the Oracle University Podcast.
Summary Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about the seven principles of modern data pipelines Interview Introduction How did you get involved in the area of data management? Can you start by defining what you mean by a "modern" data pipeline? At Rivery you published a white paper identifying seven principles of modern data pipelines: Zero infrastructure management ELT-first mindset Speaks SQL and Python Dynamic multi-storage layers Reverse ETL & operational analytics Full transparency Faster time to value What are the applications of data that you focused on while identifying these principles? How do the application of these principles influence the ability of organizations and their data teams to encourage and keep pace with the use of data in the business? What are the technical components of a pipeline infrastructure that are necessary to support a "modern" workflow? How do the technologies involved impact the organizational involvement with how data is applied throughout the business? When using managed services, what are the ways that the pricing model acts to encourage/discourage experimentation/exploration with data? What are the most interesting, innovative, or unexpected ways that you have seen these seven principles implemented/applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to adapt to these principles? What are the cases where some/all of these principles are undesirable/impractical to implement? What are the opportunities for further advancement/sophistication in the ways that teams work with and gain value from data? Contact Info LinkedIn (https://www.linkedin.com/in/ariel-pohoryles-88695622/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Rivery (https://rivery.io/) 7 Principles Of The Modern Data Pipeline (https://rivery.io/downloads/7-principles-modern-data-pipeline-lp/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Reverse ETL (https://rivery.io/blog/what-is-reverse-etl-guide-for-data-teams/) Martech Landscape (https://chiefmartec.com/2023/05/2023-marketing-technology-landscape-supergraphic-11038-solutions-searchable-on-martechmap-com/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=54d5c4916088) Databricks (https://www.databricks.com/) Snowflake (https://www.snowflake.com/en/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
This week we had a very special guest on the podcast: Matthew Lynley, one of the founding hosts of Equity and a former TechCruncher. Since his Equity days, Lynley went off and started his very own AI-focused publication called Supervised.We brought him back on the show to ask him questions in a format where we can all learn together. Here's what we got into:From Transformers to GPT4: How attention became so critical inside of neural networks, and how transformers set the path for modern AI services.Recent acquisitions in the AI space, and what it means for the “LLM stack:” With Databricks buying MosaicML and Snowflake already busy with its own checkbook, a lot of folks are working to build out a full-stack LLM data extravaganza. We talked about what that means.Where startups sit in the current AI race: While it's great to think about the majors, we also need to know what the startup angle is. The answer? It's a little early to say, but what is clear is that startups are taking some big swings at the industry and are hellbent to snag a piece of the pie.Thanks to everyone for hanging out with us. Equity is back on Friday for our weekly news roundup!For episode transcripts and more, head to Equity's Simplecast website.Equity drops at 7 a.m. PT every Monday, Wednesday and Friday, so subscribe to us onApple Podcasts, Overcast, Spotify and all the casts. TechCrunch also has a great show on crypto, a show that interviews founders, one that details how our stories come together and more!
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Alex Merced discusses some of the fallout from Databricks’ UNIFormat announcement, and the innovation the industry needs to unlock the data lakehouse. Follow me on twitter @amdatalakehouse
Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
In an era of dried-up funding and Data Lakehouse vendor supremacy, Redpanda is going against the grain. The company just secured a $100 million Series C funding round to execute on an unconventional strategy. Redpanda Founder and CEO Alex Gallego explains how things work for the company. Article published on Orchestrate all the Things
This episode of Over the Edge features an interview between Matt Trifiro and Jay Limbasiya, Global AI & Data Science business Development Lead at Dell Technologies. With a background in the United States Intelligence Community, Jay specializes in AI, data science, analytics, and data management. Matt and Jay discuss why we should care about the edge, the complexities of deciding where to treat data and the last mile problem. They also talk about the benefits of data lakehouses and the current debate around the dangers of AI. ---------Key Quotes:“You can use this data maybe ten years down the road for something else, right? You can use it for a completely different type of use case you never even thought of. And that's why the data is so valuable. It's not so much the algorithm itself, it's can you capture this data and even save this data for maybe technology that hasn't even come out yet?”“It definitely doesn't make sense to me to completely say, Hey, the data's too large at the edge, so we're just not gonna consume it… that obviously doesn't make sense. I think there's a efficiency play out here as well. And that efficiency play is, obviously it costs money to store data, so let's make sure we store the right data.”“What's the value of the data that you're collecting and how can these pieces of data enhance the algorithm at the edge? Because even though you deploy something on the edge in order to refine it for future trends, maybe four years down the road, or three years down the road, you still need data to be able to kind of do that R and D work.”“The core is actually where you identify, is there data I'm missing? Is there something I haven't even collected?”---------Show Timestamps:(02:30) Jay's start in technology(07:01) Working in U.S. intelligence(09:00) Why we should care about edge?(13:26) How to make decisions about managing and treating data(18:45) The Last Mile Problem (26:00) Data cleansing(29:00) Data analysis at the edge (30:00) Training the models and pushing them towards the edge(35:14) Data lakehouses(44:42) Jay's thoughts on the dangers of AI--------Sponsor:Over the Edge is brought to you by Dell Technologies to unlock the potential of your infrastructure with edge solutions. From hardware and software to data and operations, across your entire multi-cloud environment, we're here to help you simplify your edge so you can generate more value. Learn more by visiting DellTechnologies.com/SimplifyYourEdge for more information or click on the link in the show notes.--------Links:Follow Matt Trifiro on Twitter: https://twitter.com/mtrifiroFollow Matt Trifiro on LinkedIn: http://linkedin.com/in/mtrifiroConnect with Jay Limbasiya on LinkedIn: https://www.linkedin.com/in/jaylimbasiya/www.CaspianStudios.com
Today's guest is the author of a popular Medium blog where he has recently been dissecting generative AI for technologists. I read his introduction to the transformer architecture and immediately realized our audience needs to meet him. A bit like great recent guest Ken Wenger, Pradeep makes complicated technology accessible. By day, Pradeep Menon is a CTO at Microsoft's digital natives division in APAC. He has had one of the best ground floor views of generative AI since Microsoft first invested in OpenAI in 2019 and then again in March of this year.Pradeep was previously in similar roles at Alibaba and IBM. He speaks frequently on topics related to emerging tech, data, and AI to global audiences and is a published author.Listen and learn...What surprises Pradeep most about the capabilities of LLMs What most people don't understand about how LLMs like GPT are trained The difference between prompting and fine-tuning Why ChatGPT performs so well as a coding co-pilot How RLHF works How Bing uses grounding to mitigate the impact of LLM hallucinations How Pradeep uses ChatGPT to improve his own productivity How we should regulate AI What new careers AI is creating References in this episode...Ken Wenger on AI and the Future of Work Pradeep's book Data Lakehouse in ActionD-ID speaking avatars
No episódio de hoje, Luan Moreno, Mateus Oliveira e Antony Lucas entrevistaram Dipankar Mazumdar, atualmente como Data Advocate na Dremio.Dremio é uma das mais conhecidas tecnologias de Self-Service SQL Analytics de mercado, unificando a visão dos dados e utilizando a lingua franca de dados: o SQL. Alinhado com o Apache Iceberg, o Dremio traz a proposta de ser um Open Data Lakehouse. Com Apache Iceberg, você tem os seguintes benefícios:Compactação de Dados;Time Travel;ACID;Hidden Partition;Desenvolvido para multi-plataforma.Falamos também nesse bate-papo sobre os seguintes temas:Engenharia de Dados;Apache Iceberg;Dremio.Aprenda mais sobre como o Dremio e Iceberg que juntos, podem prover mais uma opção de Data Lakehouse, principalmente para casos que vamos trabalhar com plataformas distintas de processamento e exploração de dados.Dipankar Mazumdar = Linkedinhttps://www.dremio.com/https://iceberg.apache.org/ Luan Moreno = https://www.linkedin.com/in/luanmoreno/
Sagar Lad is a Technical Solution Architect with a leading multinational software company and has deep expertise in implementing Data & Analytics solutions for large enterprises using Cloud and Artificial Intelligence. He is an experienced Azure Platform evangelist with 9+ Years of IT experience and a strong focus on driving cloud adoption for enterprise organizations using Microsoft Cloud Solutions & Offerings. He loves blogging and is an active blogger on Medium, LinkedIn, and the C# Corner developer community. He was awarded the C# Corner MVP in September 2021 for his contributions to the developer community. He's also the author of three books, Mastering Databricks Lakehouse Platform, Azure Security for Critical Workloads, and Hands-On Azure Data Platform. Topics of Discussion: [2:57] Sagar talks about the critical points in his career that led him to technology. [6:01] What turned Sagar on to a love of data? [8:39] With so much technical jargon out there, how do you simplify? [12:40] What is Data Lakehouse? [13:25] What are some common scenarios where Data Lakehouse can be really valuable? [18:53] What does unit testing mean in the data bricks world? [22:10] How long does it take to run the tests in Azure? [25:42] What's the most expensive Databricks environment that Sagar has seen on a monthly basis? [27:54] What are some of the things that are being missed around the industry? [31:42] Sagar says that when we talk about security, there are seven layers. Mentioned in this Episode: Clear Measure Way Architect Forum Software Engineer Forum Programming with Palermo — New Video Podcast! Email us programming@palermo.network Clear Measure, Inc. (Sponsor) .NET DevOps for Azure: A Developer's Guide to DevOps Architecture the Right Way, by Jeffrey Palermo — Available on Amazon! Jeffrey Palermo's Twitter — Follow to stay informed about future events! Architect Tips — Video podcast! Azure DevOps .NET Clear Measure Architect Forum Sagar Lad books on Amazon Certifications: Sagar Lad on Credly LinkedIn: Sagar Lad on LinkedIn Twitter: @AzureSagar (Twitter: Sagar Lad) Medium: Sagar Lad on Medium Want to Learn More? Visit AzureDevOps.Show for show notes and additional episodes.
Today's successful enterprises know that data products are crucial. And the data lakehouse is increasingly viewed as the best of both worlds: combining the speed and accuracy of a data warehouse, with the size and scope of a data lake. But there has been one missing ingredient: a low-code solution for building platform-agnostic pipelines. Check out this episode of DM Radio to learn from renowned Analyst Merv Adrian, who will deliver a presentation on the evolution of data. He'll be joined by Databricks Guru Simon Whitely of Advancing Analytics, and Raj Bains, CEO of Prophecy.io. They'll discuss the importance of enabling business users of all skill levels to visualize and implement highly performant data pipelines with auto-generated open source code, thus avoiding vendor lock-in.
No episódio de hoje, Luan Moreno e Mateus Oliveira entrevistaram Denny Lee & Mathew Powers, atualmente Developer Advocates na Databricks.Delta Lake é um produto open-source, que nos permite aplicar o famoso Data Lakehouse {Data Lake + Data Warehouse}, desenvolvido pela empresa dos criadores do Apache Spark. Delta Lake resolve o problema do Apache Spark, armazenamento, processamento de dados no Data Lake de forma otimizada.Com Delta Lake, você tem os seguintes benefícios:Formato de arquivo como se fosse uma tabela;Time Travel;ACID;Batch e Streaming Unificados.Falamos também nesse bate-papo sobre os seguintes temas:Estado da arte dos dados;Delta Lake.Aprenda mais sobre Delta Lake, como utilizar uma tecnologia para Data LakeHouse, junto com o time da databricks que mais impulsiona a comunidade com conteúdos, releases e eventos para ajudar este produto open-source.Denny Lee - Linkedin Mathew Powers - Linkedinhttps://delta.io/ Luan Moreno = https://www.linkedin.com/in/luanmoreno/
Summary All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow Interview Introduction impact of community tech debt hive metastore new work being done but not widely adopted tensions between automation and correctness data type mapping integer types complex types naming things (keys/column names from APIs to databases) disaggregated databases - pros and cons flexibility and cost control not as much tooling invested vs. Snowflake/BigQuery/Redshift data modeling dimensional modeling vs. answering today's questions What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform? When is ELT the wrong choice? What do you have planned for the future of your data platform? Contact Info LinkedIn (https://www.linkedin.com/in/tmacey/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links dbt (https://www.getdbt.com/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Trino (https://trino.io/) Podcast Episode (https://www.dataengineeringpodcast.com/presto-distributed-sql-episode-149/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=5c0e333f6088) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) Redshift (https://aws.amazon.com/redshift/) Technical Debt (https://en.wikipedia.org/wiki/Technical_debt) Hive Metastore (https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration) AWS Glue (https://aws.amazon.com/glue/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Alex Merced helps explain how stats are collected and used when working with Parquet files and Apache Iceberg tables. Follow Alex on twitter @amdatalakehouse
This is episode 12 of the ThinkData podcast in partnership with Dataworks. In this episode, we welcomed Ari Kaplan from Databricks. Ari is the Head Evangelist responsible for increasing awareness of the Databricks product and has a fascinating background in Sports analytics, where he spent several years with the Baltimore Orioles and Chicago Cubs. His career has seen him then work for the likes of Nielson and Datarobot where he has been at the forefront of Data and AI for over 15 years.We covered a lot, including - What does a week in the life of an evangelist look like and what IS an evangelist?What took him to a career in sports analytics and then on to work for some of the most recognized Data vendors in the world? Who are Databricks and what makes them such a force to be reckoned with?In a market with a new tool and vendor being released each day, what key considerations should organizations make before deciding what tools and tech stack to go with? What is behind the seemingly rapid rise and adoption of AI Tools and techniques to the mainstreamWhat are Ari's predictions for the next 12/18 months in the world of Data & AIHis advice for anyone looking to break into the more commercial side of AIThank you Ari for a cracking discussion, and for anyone interested in finding out more about him and Dataworks, make sure you give him a follow on here.
MLOps Coffee Sessions #155 with Matei Zaharia, The Birth and Growth of Spark: An Open Source Success Story, co-hosted by Vishnu Rachakonda. // Abstract We dive deep into the creation of Spark, with the creator himself - Matei Zaharia Chief technologist at Databricks. This episode also explores the development of Databricks' other open source home run ML Flow and the concept of "lake house ML". As a special treat Matei talked to us about the details of the "DSP" (Demonstrate Search Predict) project, which aims to enable building applications by combining LLMs and other text-returning systems. // About the guest: Matei has the unique advantage of being able to see different perspectives, having worked in both academia and the industry. He listens carefully to people's challenges and excitement about ML and uses this to come up with new ideas. As a member of Databricks, Matei also has the advantage of applying ML to Databricks' own internal practices. He is constantly asking the question "What's a better way to do this?" // Bio Matei Zaharia is an Associate Professor of Computer Science at Stanford and Chief Technologist at Databricks. He started the Apache Spark project during his Ph.D. at UC Berkeley, and co-developed other widely used open-source projects, including MLflow and Delta Lake, at Databricks. At Stanford, he works on distributed systems, NLP, and information retrieval, building programming models that can combine language models and external services to perform complex tasks. Matei's research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best Ph.D. dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE). // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links https://cs.stanford.edu/~matei/ https://spark.apache.org/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/ Connect with Matei on LinkedIn: https://www.linkedin.com/in/mateizaharia/ Timestamps: [00:00] Matei's preferred coffee [01:45] Takeaways [05:50] Please subscribe to our newsletters, join our Slack, and subscribe to our podcast channels! [06:52] Getting to know Matei as a person [09:10] Spark [14:18] Open and freewheeling cross-pollination [16:35] Actual formation of Spark [20:05] Spark and MLFlow Similarities and Differences [24:24] Concepts in MLFlow [27:34] DJ Khalid of the ML world [30:58] Data Lakehouse [33:35] Stanford's unique culture of the Computer Science Department [36:06] Starting a company [39:30] Unique advice to grad students [41:51] Open source project [44:35] LLMs in the New Revolution [47:57] Type of company to start with [49:56] Emergence of Corporate Research Labs [53:50] LLMs size context [54:44] Companies to respect [57:28] Wrap up
Steeds meer organisaties zijn (of willen) met hun data aan de slag. Waarde halen uit data zorgt voor nieuwe inzichten, helpt producten te optimaliseren en helpt bij het innoveren en automatiseren van business processen. Om met data aan de slag te gaan gebruiken organisaties vaak een data warehouse of een data lake. Sinds enige tijd is daar het data lakehouse bijgekomen. Wanneer moet je kiezen voor een lakehouse en wanneer heb je aan een data lake of data warehouse voldoende?In deze aflevering van Techzine Talks schuift Ivo Everts aan. Hij is strategic architect bij Databricks, het bedrijf dat de term en definitie van een lakehouse heeft bedacht en in de markt gezet. Daarmee is Everts de ideale persoon om uit te leggen wat een lakehouse nu precies is en hoe dat verschilt van het data lake en data warehouse.Het gaat in deze aflevering over gestructureerde data, ongestructureerde data, hoe je er inzichten uit kan halen en hoe je machine learning kan toepassen. Ook stellen we de vraag voor welke doelgroepen de oplossingen geschikt zijn. Is de data analist de doelgroep, of moet een zakelijke professional op de verkoop of marketingafdeling hier ook mee aan de slag kunnen? Luister snel deze Techzine Talks.
No mercado há 33 anos e com mais 2 terabyte de informações dos clientes, a Seguros Unimed tem investido no uso de ferramentas para impulsionar análises e garantir a melhor experiência para os clientes. O mais recente passo nessa jornada foi a parceria com a AWS para processar e disponibilizar dados a partir de um data lakehouse, utilizado pelas áreas para inovar em diferentes frentes. Ao lidar com dados em cinco áreas de negócios - Saúde, Vida, Odontologia, Previdência (aberta e fechada) e Ramos Elementares (com seguros patrimoniais e de responsabilidade civil médica), a Seguros Unimed identificou a necessidade de melhorar a gestão de seus recursos e se adequar às normas da Lei Geral de Proteção de Dados (LGPD) a partir do uso de tecnologia de ponta, associada a ferramentas open source, conhecidas pelos custos mais atrativos. A partir do projeto, a empresa passou a usar dados de maneira mais centralizada e padronizada, diminuindo os custos – um dos principais objetivos do projeto –, obtendo todos os outros ganhos possíveis e mantendo em conformidade com a LGPD, configurando uma significativa melhoria na governança. “O projeto nos abriu portas para contar, por exemplo, com uma precificação mais inteligente e sob medida para o beneficiário para que o cliente se sinta com o seguro certo para ele”, explica o diretor-executivo de mercado e tecnologia da Seguros Unimed, Wilson Leal. No podcast 'E agora, TI?', o primeiro da temporada de 2023, Leal revela detalhes do projeto, bem como cifras de redução de custos e melhorias para os clientes.
It would appear Microsoft has run out of words to use for new features/products, so they are just going to recycle existing words for a bit of fun at our expense. In this episode we explore the Azure Data Lakehouse—which is not a data lake, nor a traditional data warehouse and Books Online (BOL) refers to it as Databricks Lakehouse. Luke Moloney walks us through how combining the flexibility of data lake storage with some ACID transaction and data governance gives organizations looking to analyze their data a new option. The hope is the cost to build, store, and analyze data will be easier and more approachable for organizations who don't want to go with the traditional data warehouse model. This still feels like an Enterprise feature to me but let me know if your organization would be interested in this approach. As always, special thanks to Luke and the folks at Microsoft for making themselves available to us. The show notes and video for today's episode can be found at https://sqldatapartners.com/2023/03/01/episode-261-the-data-lakehouse. Have fun on the SQL Trail!
Highlights from this week's conversation include:Alex's background in the data space (2:41)Comics and Pop Culture Blending with Finance training (5:20)What is a data lake house? (7:36)What is Dremio solving in for users? (11:21)Essential components of a data lake house (16:35)Difference between on-prem and cloud experiences (33:53)What does it mean to be a developer advocate? (41:31)Final thoughts and takeaways (49:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Summary Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it's all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular Interview Introduction How did you get involved in the area of data management? Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem? Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations? What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018? Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects? Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons? For someone who wants to manage their data in Iceberg tables, what does the implementation look like? How does that change based on the type of query/processing engine being used? Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular? When is Iceberg/Tabular the wrong choice? What do you have planned for the future of Iceberg/Tabular? Contact Info LinkedIn (https://www.linkedin.com/in/rdblue/) rdblue (https://github.com/rdblue) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hadoop (https://hadoop.apache.org/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/) ACID == Atomic, Consistent, Isolated, Durable (https://en.wikipedia.org/wiki/ACID) Apache Hive (https://hive.apache.org/) Apache Impala (https://impala.apache.org/) Bodo (https://www.bodo.ai/) Podcast Episode (https://www.dataengineeringpodcast.com/bodo-parallel-data-processing-python-episode-223/) StarRocks (https://www.starrocks.io/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-open-data-lakehouse-episode-333/) DDL == Data Definition Language (https://en.wikipedia.org/wiki/Data_definition_language) Trino (https://trino.io/) PrestoDB (https://prestodb.io/) Apache Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209/) dbt (https://www.getdbt.com/) Apache Flink (https://flink.apache.org/) TileDB (https://tiledb.com/) Podcast Episode (https://www.dataengineeringpodcast.com/tiledb-universal-data-engine-episode-146/) CDC == Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Substrait (https://substrait.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Summary Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm being interviewed by Scott Hirleman about my work on the podcasts and my experience building a data platform Interview Introduction How did you get involved in the area of data management? Data platform building journey Why are you building, who are the users/use cases How to focus on doing what matters over cool tools How to build a good UX Anything surprising or did you discover anything you didn't expect at the start How to build so it's modular and can be improved in the future General build vs buy and vendor selection process Obviously have a good BS detector - how can others build theirs So many tools, where do you start - capability need, vendor suite offering, etc. Anything surprising in doing much of this at once How do you think about TCO in build versus buy Any advice Guest call out Be brave, believe you are good enough to be on the show Look at past episodes and don't pitch the same as what's been on recently And vendors, be smart, work with your customers to come up with a good pitch for them as guests... Tobias' advice and learnings from building out a data platform: Advice: when considering a tool, start from what are you actually trying to do. Yes, everyone has tools they want to use because they are cool (or some resume-driven development). Once you have a potential tool, is the capabilty you want to use a unloved feature or a main part of the product. If it's a feature, will they give it the care and attention it needs? Advice: lean heavily on open source. You can fix things yourself and better direct the community's work than just filing a ticket and hoping with a vendor. Learning: there is likely going to be some painful pieces missing, especially around metadata, as you build out your platform. Advice: build in a modular way and think of what is my escape hatch? Yes, you have to lock yourself in a bit but build with the possibility of a vendor or a tool going away - whether that is your choice (e.g. too expensive) or it literally disappears (anyone remember FoundationDB?). Learning: be prepared for tools to connect with each other but the connection to not be as robust as you want. Again, be prepared to have metadata challenges especially. Advice: build your foundation to be strong. This will limit pain as things evolve and change. You can't build a large building on a bad foundation - or at least it's a BAD idea... Advice: spend the time to work with your data consumers to figure out what questions they want to answer. Then abstract that to build to general challenges instead of point solutions. Learning: it's easy to put data in S3 but it can be painfully difficult to query it. There's a missing piece as to how to store it for easy querying, not just the metadata issues. Advice: it's okay to pay a vendor to lessen pain. But becoming wholly reliant on them can put you in a bad spot. Advice: look to create paved path / easy path approaches. If someone wants to follow the preset path, it's easy for them. If they want to go their own way, more power to them, but not the data platform team's problem if it isn't working well. Learning: there will be places you didn't expect to bend - again, that metadata layer for Tobias - to get things done sooner. It's okay to not have the end platform built at launch, move forward and get something going. Advice: "one of the perennial problems in technlogy is the bias towards speed and action without necessarily understanding the destination." Really consider the path and if you are creating a scalable and maintainable solution instead of pushing for speed to deliver something. Advice: consider building a buffer layer between upstream sources so if there are changes, it doesn't automatically break things downstream. Tobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt Contact Info LinkedIn (https://www.linkedin.com/in/scotthirleman/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Mesh Community (https://datameshlearning.com/community/) Podcast (https://www.linkedin.com/company/80887002/admin/) OSI Model (https://en.wikipedia.org/wiki/OSI_model) Schemata (https://schemata.app/) Podcast Episode (https://www.dataengineeringpodcast.com/schemata-schema-compatibility-utility-episode-324/) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) Chris Riccomini (https://daappod.com/data-mesh-radio/devops-for-data-mesh-chris-riccomini/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Was ist ein Data Lakehouse? Verbindung der Vorteile eines Data Warehouse mit einem Data Lake = Data Lakehouse
What is the data lakehouse? What are its benefits and considerations? Should every organization invest in one? Sanjeev Mohan is the Principal at SanjMo and is a Former Gartner Research VP, Big Data and Advanced Analytics.
Neste episódio falaremos sobre o Dremio, o projeto open-source que se descreve como The Data Lake Engine, sendo uma ferramenta que permite realizar a integração de dados provenientes das mais variadas fontes de dados.O projeto, The Data Lake Engine, tem benefícios e arquitetura integrada com bancos relacionais, bases colunares, indexadores dentre outros tipos. Hoje recebemos Alex Merced, Desenvolvedor e Advocate na Dremio e Data Lakehouse Evangelist que compartilhou conosco seu vasto conhecimento sobre o assunto.Dremio = The Easy and Open Data Lakehouse Luan Moreno = https://www.linkedin.com/in/luanmoreno/
The Six Five "On The Road" at Cloudera #EvolveNYC. Hosts Daniel Newman and Patrick Moorhead are joined by Wim Stoop, Sr. Director, Hybrid Data Platform & David Dichmann, Sr. Director, Data Warehouse & Lakehouse, Cloudera. They discuss the advancements & advantages of a portable, Hybrid Data Lakehouse. Disclaimer: The Six Five Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we do not ask that you treat us as such.
The Six Five is "On The Road" at Cloudera #EvolveNYC. Hosts Daniel Newman and Patrick Moorhead are joined by Ram Venkatesh, CTO & Bill Zhang, Sr. Director, Product Management, Iceberg at Cloudera. They discuss Apache Iceberg and its benefits in a hybrid cloud-native environment, specifically with #analytics. Disclaimer: The Six Five Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we do not ask that you treat us as such.
The "lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.
The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.
Remember when Hadoop was predicted to replace the data warehouse? How'd that work out for Hadoop? Data Warehousing is doing just fine, and has evolved in a variety of customer-friendly ways in the last few years. It can also play nice with data science and data lakehouses, as well as modern data pipelines and other interesting analytics architectures. There are even hyperscale data warehouses these days! Find out more on this episode of DM Radio, as Host Eric Kavanagh interviews veteran analyst Philip Russom, along with Tyler Owen of Teradata, and Chris Gladwin of Ocient.
Highlights from this week's conversation include:Kyle's background and career journey (2:38)Unique challenges in building data engineering products (9:33)The problem set Databricks resolves (13:46)About Onehouse (17:15)From Microsoft to Onehouse (20:59)Why there's so much distance between data powers (24:45)Why the data lake is not enough (30:15)Who should have a lake house (39:03)Why we have all three data platforms (43:53)How to step into the data lake house world (49:48)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Alex Merced discusses the many angles a data engineer should think about engineering data for better performance. Read more at dremio.com/subsurface
Join Alex Merced and Bob Haffner for a discussion about the Open Data Lakehouse concept #data #dataengineering #datalake #datalakehouse Connect with Alex Twitter - @amdatalakehouse Connect with Bob Twitter - @bobhaffner LinkedIn - linkedin.com/in/bobhaffner Show notes The DataNation Podcast Available on iTunes/Spotify/Stitcher The Subsurface Data Lakehouse Community dremio.com/subsurface Dremio dremio.com Follow the podcast on Twitter @EngSideOfData
Highlights from this week's conversation include:Vinoth's background and career journey (3:08)Defining “data lakehouse” (5:10)Databricks versus lake houses (13:37)The services a lakehouse needs (17:37)How to communicate technical details (26:55)Onehouse's product vision (31:41)Lakehouse performance versus BigQuery solutions (36:44)How to deliver customer experience equally (40:17)How to start building a lakehouse (44:00)Big tech's effect on smaller lakehouses (55:33)Skipping the data warehouse (1:04:39)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Data Lakehouse - Was ist es, braucht man es und was muss man darüber wissen? Andreas und Carsten besprechen außerdem die M&A News & aktuellsten BARC Studien, was in den letzten beiden Monaten im BI-Markt so los war und welche wichtigen Events dieses Jahr noch anstehen! ⪧ Studien • The Planning Survey 22 • Driving Innovation with AI. Getting Ahead with DataOps and MLOps • BARC Score Enterprise BI & Analytics Platforms • BARC Score Analytics for Business Users • CFO-Studie (New Value for the CFO): Konsolidierung macht Platz für integrierte Konzernrechnungslegung ⪧ Events • BI or DIE Level Up - Part II • BI or DIE on tour • Data Festival • Big Data & AI World • BARC Future of SAP Data & Analytics • DATA Festival #online
In this episode of Q.pod, learn how to improve your business performance and gain a competitive advantage by modernizing your data management and analytics with a Data Lakehouse.
Trazemos nesse episódio o especialista Pedro Toledo para falar um pouco da sua experiência com a tecnologia de Big Data mais utilizada do mundo. Discutimos sobre os seguintes temas:Importância do Apache Spark e Casos de UsoCurva de AprendizagemLinguagens de ProgramaçãoProblemas ComunsDBT vs. Apache Spark e Stack Moderna de DadosDelta Lake e Data LakehouseDicas para IniciantesA intenção principal é mostrar para um Engenheiro de Dados como o Apache Spark é uma poderosa ferramenta de Analytics e como a mesma pode ser utilizada para resolver problemas na área de Big Data.No YouTube possuímos um canal de Engenharia de Dados com os tópicos mais importantes dessa área e com lives todas as quartas-feiras.https://www.youtube.com/channel/UCnErAicaumKqIo4sanLo7vQ Quer ficar por dentro dessa área com posts e updates semanais, então acesse o LinkedIN para não perder nenhuma notícia.https://www.linkedin.com/in/luanmoreno/ Disponível no Spotify e na Apple Podcasthttps://open.spotify.com/show/5n9mOmAcjra9KbhKYpOMqYhttps://podcasts.apple.com/br/podcast/engenharia-de-dados-cast/ LinkedIN do Pedro Toledo = https://www.linkedin.com/in/pedro-toledo/ Luan Moreno = https://www.linkedin.com/in/luanmoreno/
Trazemos nesse episódio o especialista Lucas Magalhães para falar um pouco de projetos de Big Data e Analytics dentro do Google GCP Discutimos sobre os projetos que podem ser facilmente implementados assim como melhores formas e tecnologias utilizadas para lidar com processamento massivo de dados.No YouTube possuímos um canal de Engenharia de Dados com os tópicos mais importantes dessa área e com lives todas as quartas-feiras.https://www.youtube.com/channel/UCnErAicaumKqIo4sanLo7vQ Quer ficar por dentro dessa área com posts e updates semanais, então acesse o LinkedIN para não perder nenhuma notícia.https://www.linkedin.com/in/luanmoreno/ Disponível no Spotify e na Apple Podcasthttps://open.spotify.com/show/5n9mOmAcjra9KbhKYpOMqYhttps://podcasts.apple.com/br/podcast/engenharia-de-dados-cast/ Luan Moreno = https://www.linkedin.com/in/luanmoreno/
Trazemos nesse episódio o especialista Carlos Barbosa para falar um pouco de projetos de Big Data e Analytics dentro da Amazon AWS. Suas importantes considerações e recomendações para a criação de pipelines em batch e streaming e como otimizar recursos e aumentar valor utilizando os produtos de forma mais eficaz.Falamos também sobre as melhores práticas de implementação assim como casos de uso e o dia a dia de um engenheiro de dados trabalhando na maior nuvem do mercado hoje em dia.No YouTube possuímos um canal de Engenharia de Dados com os tópicos mais importantes dessa área e com lives todas as quartas-feiras.https://www.youtube.com/channel/UCnErAicaumKqIo4sanLo7vQ Quer ficar por dentro dessa área com posts e updates semanais, então acesse o LinkedIN para não perder nenhuma notícia.https://www.linkedin.com/in/luanmoreno/ Luan Moreno = https://www.linkedin.com/in/luanmoreno/
This episode features an interview with Tomer Shiran, Founder and Chief Product Officer at Dremio. Dremio is a high-performance SQL lakehouse platform that helps companies get more from their data in the fastest way possible. Prior to Dremio, Tomer served as VP of Product at MapR and also held product management and engineering roles at Microsoft and IBM Research. He also has a master's degree from Carnegie Mellon University as well as a bachelor's from Technion - Israel Institute of Technology.In this episode, Tomer and Sam dive into the economics of storing data, how to build an open architecture, and what exactly a data lakehouse is.-------------------“I think in the world of data lakes and lakehouses, the model has shifted upside down. Now, instead of bringing the data into the engines, you're actually bringing the engines to the data. So you have this open data tier built on open source technology. The data is represented in open source formats and stored in the company's S3 account or Azure storage account. And then you can use a variety of engines. We at Dremio, we take pride in building the best SQL engine to use on the data. There are different streaming engines, like Spark and Flink. There are different batch processing and machine learning engines. Spark is an example of that as well that companies can use on that same data. And I think that's one of the really important things from a cost standpoint, too, is that this really lowers your overall costs, both today and also in the future as you scale.” – Tomer Shiran-------------------Episode Timestamps:(02:04): What open source data means to Tomer(03:14): Tomer's motivation behind Apache Arrow(06:42): How Tomer solved data accessibility (08:43): The unit economics of storing data(14:31): Tomer's motivations for Iceberg and how it relates to Project Nessie(17:06): What is a data lakehouse?(18:31): What gives Dremio its magic?(23:39): What cloud data architecture will look like in 5 years(27:19): Advice for building an open data architecture-------------------Links:LinkedIn - Connect with TomerLinkedIn - Connect with DremioTwitter - Follow TomerTwitter - Follow DremioVisit DremioGet started with Dremio
On The Cloud Pod this week, the team wishes for time-traveling data. Also, GCP announces Data Lakehouse, Azure hosts Ignite 2021, and Microsoft is out for the metaverse. A big thanks to this week's sponsors: Foghorn Consulting, which provides full-stack cloud solutions with a focus on strategy, planning and execution for enterprises seeking to take advantage of the transformative capabilities of AWS, Google Cloud and Azure. JumpCloud, which offers a complete platform for identity, access, and device management — no matter where your users and devices are located. This week's highlights
O que é um Data Lakehouse? Parece mais uma nova modinha, mas não: é uma nova forma de se construir uma Plataforma que facilita e democratiza o acesso a dados, desde sua criação. Legal né? Essa e muitas outras discussões permearam nosso episódio 44, com a presença dos feras em Data Engineering do Grupo Boticário. Trouxemos as grandes referências do GB em Engenharia e Arquitetura de Dados para dar essa aula pra gente: Robson Mendonça (Gerente SR Engenharia de Dados), Edson Junior (Gerente de Engenharia de Dados) Marcus Bittencourt (Gerente de Arquitetura e Plataforma de Dados). Veja os links do episódio no nosso post do Medium: https://medium.com/data-hackers/data-lakehouse-e-muito-mais-no-grupo-botic%C3%A1rio-data-hackers-podcast-44-481148fd5775
O Delta Lake é uma engine de armazenamento otimizado para construção de projetos de Big Data e Analytics especialmente desenhado para o Apache Spark.A engine foi criada para armazenar grandes quantidades de dados (Data Lake) e também organizar dados em formas de tabelas (Data Warehouse), dessa forma a consulta dentro desse formato de arquivo pode ser indexada de forma eficiente.Além disso, diversos recursos foram adicionados como - transações acid, viagem no tempo (time travel), auditoria, operações de dml (insert, update, delete e merge) e outros recursos valiosos para operações em grandes massas de dados. Luan Moreno = https://www.linkedin.com/in/luanmoreno/
Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity.
Você gostaria de compreender a diferença entre Data Warehouse e Data Lakehouse e ir além para entender de fato o que acontece na realidade das empresas que adotam essas soluções? O Orlando Marley é um dos grandes especialistas nessa área e com ele, iremos dar dicas de como entender melhor esses dois paradigmas e como você pode unir essas duas soluções para entregar um Analytics marcante para sua empresa.Data Lakehouse é um novo conceito que vem ganhando tração rapidamente e para você poder se destacar como um engenheiro de dados se faz necessário aprender sobre. Luan Moreno = https://www.linkedin.com/in/luanmoreno/
A Databricks está liderando um novo paradigma de dados, o "Data Lakehouse". E a proposta é construir uma plataforma usando os melhores recursos de Data Warehouses e Data Lakes. Interessante, não? Para entender como essa arquitetura de dados funciona na prática, convidamos Pedro Leite Gusmão e Matheus Araujo Gava, ambos Data Engineer na everis Brasil, e Vanessa de Sousa Oliveira, Data Architect na everis Brasil.
Is the data lakehouse more than just a buzzword? Why has the term sprung up and are there any use cases where a lakehouse might make sense? Join Helena Schwenk and Graham Sharpe, director of strategic solutions at Exasol, as they explore the term and provide practical advice for data pros and organizations trying to get to grips with what it really means. If you want to explore more of our podcasts and extra talking points and resources, check out the DataXpresso homepage.
In this episode, Abby Strong and Nick Heudecker discuss sprawling monitoring and observability environments as many have pursued the promise of a single pane of glass of nirvana, which usually isn't what happens in reality. What You'll Learn: What are the most common database types? The differences between a Data Lakehouse and a Data Warehouse. Why you'd want to use object stores. If you want to get every episode of the Stream Life podcast automatically, you can subscribe on Apple Podcasts, Spotify, Pocket Casts, Overcast, Castro, RSS, or wherever you get your podcasts.
Uma das dúvidas mais comuns em ambientes de big data e construção de data pipelines é de fato entender as diferenças entre os diversos tipos de storages que podemos nos conectar para processar os dados.Nesse episódio, atacamos todos os tipos que o mercado oferece mostrando seus lados positivos e negativos para que você que está construindo entenda da melhor forma como cada um desses storages se comportam.Falamos também da importância do mindset tanto do profissional como da empresa em não somente armazenar mas como processar dados de forma eficiente, madura e rápida.Entenda a evolução do mercado de Big Data e Analytics e entenda os mais novos termos e tecnologias utilizadas para construção de pipeline de dados. Luan Moreno = https://www.linkedin.com/in/luanmoreno/
In this episode Felipe and Mike talk about the evolution of the Data Warehouse to what is called a Data Lakehouse. What is it? Should every organization adopt it? Why? What are the technologies involved? And we ended up opening a Pandora's Box for following episodes.
Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
Data warehouses alone don't cut it. Data lakes alone don't cut it either. So whether you call it data lakehouse or by any other name, you need the best of both worlds, says Databricks. A new query engine and a visualization layer are the next pieces in Databricks' puzzle. We connected with Ali Ghodsi, co-founder and CEO of Databricks, to discuss their latest news: the announcement of a new query engine called Delta Engine, and the acquisition of Redash, an open source visualization product. Our discussion started with the background on data lakehouses, which is the term Databricks is advocating to signify the coalescing of data warehouses and data lakes. We talked about trends such as multi cloud and machine learning that lead to a new reality, how data warehouses and data lakes work, and what does the data lakehouse bring to the table. We also talked about Delta Engine and Redash of course, and we wrapped up with an outlook on Databricks business growth. ZDNet article published in June 2020
A data warehouse is a system for performing fast queries on large amounts of data. A data lake is a system for storing high volumes of data in a format that is slow to access. A typical workflow for a data engineer is to pull data sets from this slow data lake storage into the data warehouse for faster querying.Apache Spark is a system for fast processing of data across distributed datasets. Spark is not thought of as a data warehouse technology, but it can be used to fulfill some of the responsibilities. Delta is an open source system for a storage layer on top of a data lake. Delta integrates closely with Spark, creating a system that Databricks refers to as a “data lakehouse.”Michael Armbrust is an engineer with Databricks. He joins the show to talk about his experience building the company, and his perspective on data engineering, as well as his work on Delta, the storage system built for the Spark ecosystem.