Software framework
POPULARITY
The future of data infrastructure. We cover the explosion in compute demand, the petabytes of untapped enterprise data, energy-efficient GPUs, DeepSeek, the $500B Stargate project, and how AI is transforming data processing. Craig Dunham is CEO of Voltron Data, a company at the forefront of accelerating data processing for AI, analytics, and enterprise-scale workloads. Voltron provides the infrastructure necessary to handle enormous amounts of data — transforming bottlenecks into breakthroughs. By championing open-source frameworks like Apache Arrow, Voltron is building the connective tissue that allows businesses to process data at orders-of-magnitude speed and efficiency, reshaping industries from finance to healthcare to national security — partnering with the likes of Snowflake and Meta. Voltron have established themselves as a key part of the AI infrastructure stack and have raised a total of $110M from the likes of Coatue, LightSpeed, Google Ventures and BlackRock. With a deep background in scaling data infrastructure businesses, Craig is Voltron’s CEO. Before Voltron Data, Craig was the CEO of Lumar, a leading SaaS technical SEO platform. Prior to that, he held significant roles including General Manager at Guild Education and Seismic, where he led the integration of Seismic’s acquisition of The Savo Group and drove go-to-market strategies in the financial services sector. Craig began his career in investment banking with Citi and Lehman Brothers before transitioning into technology. He holds an MBA from Northwestern University’s Kellogg School of Management Sign up for new podcasts and our newsletter, and email me on danieldarling@focal.vcSee omnystudio.com/listener for privacy information.
Guest Alison Hill Panelist Richard Littauer Show Notes We're kicking off the new year of Sustain with host Richard Littauer discussing sustaining open source software with guest, Alison Hill, VP of Product at Anaconda, and a cognitive scientist with a PhD in psychology. Alison shares her journey from academia to industry, emphasizing the importance of statistics and data science in her career. She explains her role at Anaconda, focusing on developing secure and compatible distribution of Python packages and managing the community repository, Anaconda.org. The conversation covers the significance of product management in open source projects, particularly those with corporate backing, and how these roles can help in balancing user needs and business goals. In addition, Alison shares her thoughts on the challenges and strategies for maintaining open source projects without corporate support and touches on the ‘palmer penguins' project. Click to download now! [00:01:13] Alison discusses her transition from academic research in cognitive science to industry and data science, emphasizing her passion for statistics and education. [00:02:41] Alison explains her work at Anaconda, focusing on product management and the Anaconda distribution, aiming to ease the use of Python and R packages in the industry and academia. She also elaborates on other projects she oversees, including Anaconda.org and its role in supporting open source projects and enterprise needs. [00:05:17] We hear how Anaconda sustains itself financially through enterprise offerings and the balance of supporting open source while maintaining a business model. [00:07:14] Alison shares her previous experience as the first PM of data science communication at Posit (formerly RStudio) and her role in enhancing data science education and product development. [00:12:49] Richard and Alison explore the challenges of sustaining open source projects without corporate backing and strategies for maintaining personal and project health in the open source community. Alison discusses common mistakes companies make by confusing project management with product management in open source projects. [00:17:18] Richard asks about the skills needed for developers to adopt a product-oriented approach. Alison suggests that successful product-oriented developers often have high empathy for end-users and experience with the pain points at scale, which helps them anticipate and innovate solutions effectively. [00:20:49] Richard expresses concerns about the sustainability of smaller, community-led open source projects that lack corporate backing and the structured support that comes with it. Alison acknowledges her limited experience with non-corporate open source projects but highlights the difficulty in maintaining such projects without institutional support, and she shares her personal challenges with keeping up with open source project demands. [00:27:41] Alison stresses the importance of clear goals and understanding the implications of joining larger ecosystems, reflects on the need for clarity about the desired outcomes when joining larger ecosystems, and shares examples of successful and unsuccessful engagements in such settings. [00:29:52] She discusses alternative sustainability models, including paid support and subscriptions. [00:33:00] Alison brings up the example of Apache Arrow and the challenges it faced with corporate sponsorship. [00:34:23] We wrap up with Richard acknowledging that not all open source projects require significant funding or formal business models, and Alison explains the ‘palmerpenguins' project she did at the beginning of COVID. [00:37:07] Find out where you can follow Alison on the web. Quotes [00:22:18] “What is the minimum level of support you need to not feel like you're drowning?” Spotlight [00:38:14] Richard's spotlight is Bernard Cornwell. [00:38:39] Alison's spotlight is the book, Impossible Creatures. Links SustainOSS (https://sustainoss.org/) podcast@sustainoss.org (mailto:podcast@sustainoss.org) richard@sustainoss.org (mailto:richard@sustainoss.org) SustainOSS Discourse (https://discourse.sustainoss.org/) SustainOSS Mastodon (https://mastodon.social/tags/sustainoss) Open Collective-SustainOSS (Contribute) (https://opencollective.com/sustainoss) Richard Littauer Socials (https://www.burntfen.com/2023-05-30/socials) Alison Hill, PhD Website (https://www.apreshill.com/) Alison Presmanes Hill, PhD LinkedIn (https://www.linkedin.com/in/apreshill/) Alison Presmanes Hill GitHub (https://github.com/apreshill) Anaconda (https://www.anaconda.com/) Anaconda.org (https://anaconda.org/) The Third Bit-Dr. Greg Wilson (https://third-bit.com/about/) Sustain Podcast-Episode 64: Travis Oliphant and Russel Pekrul on NumPy, Anaconda, and giving back with FairOSS (https://podcast.sustainoss.org/guests/oliphant) Intercom on Product Management (https://www.intercom.com/resources/books/intercom-product-management) Sustain Podcast-Episode 135: Tracy Hinds on Node.js's CommComm and PMs in Open Source (https://podcast.sustainoss.org/135) Hadley Wickham (https://en.wikipedia.org/wiki/Hadley_Wickham) palmerpenguins-GitHub (https://allisonhorst.github.io/palmerpenguins/articles/intro.html) Bernard Cornwell (https://en.wikipedia.org/wiki/Bernard_Cornwell) Impossible Creatures by Katherine Rundell (https://www.penguinrandomhouse.com/books/743371/impossible-creatures-by-katherine-rundell-illustrated-by-ashley-mackenzie/) Credits Produced by Richard Littauer (https://www.burntfen.com/) Edited by Paul M. Bahr at Peachtree Sound (https://www.peachtreesound.com/) Show notes by DeAnn Bahr Peachtree Sound (https://www.peachtreesound.com/) Special Guest: Alison Hill.
Apache Arrow is a columnar format and multi-language toolbox for fast data interchange and in-memory analytics. Project website I spoke with Matt Topol at Community Over Code in Denver last month about the various subprojects, and how you can get …
I had the pleasure of interviewing Wes McKinney, Creator of Pandas, a name well-known in the data world through his work on the Pandas Project and his book, Python for Data Analysis. Wes is now at Posit PBC, and during our conversation at Small Data SF, we covered several key topics around the evolving data landscape! Wes shared his thoughts on the significance of Small Data, why it's a compelling topic right now, and what “Retooling for a Smaller Data Era” means for the industry. We also dove into the challenges and potential benefits of shifting from Big Data to Small Data, and discussed whether this trend represents the next big movement in data. Curious about Apache Arrow and what's next for Wes? Check out our interview where Wes gives some great insights into the future of data tooling. #data #ai #smalldatasf2024 #theravitshow
What does it take to go from leading Kafka development at Confluent to becoming a key figure in the PostgreSQL world? Join us as we talk with Gwen Shapira, co-founder and chief product officer at Nile, about her transition from cloud-native technologies to the vibrant PostgreSQL community. Gwen shares her journey, including the shift from conferences like O'Reilly Strata to PostgresConf and JavaScript events, and how the Postgres community is evolving with tools like Discord that keep it both grounded and dynamic.We dive into the latest developments in PostgreSQL, like hypothetical indexes that enable performance tuning without affecting live environments, and the growing importance of SSL for secure database connections in cloud settings. Plus, we explore the potential of integrating PostgreSQL with Apache Arrow and Parquet, signaling new possibilities for data processing and storage.At the intersection of AI and PostgreSQL, we examine how companies are using vector embeddings in Postgres to meet modern AI demands, balancing specialized vector stores with integrated solutions. Gwen also shares insights from her work at Nile, highlighting how PostgreSQL's flexibility supports SaaS applications across diverse customer needs, making it a top choice for enterprises of all sizes.Follow Gwen on:Nile BlogX (Twitter)LinkedInNile DiscordWhat's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.
Allen Wyma talks with Andrew Lamb about InfluxDB's rewrite. InfluxDB is an open-source time series database. As a Staff Engineer at InfluxData, he works on InfluxDB 3.0, a new time series database written in Rust, focusing on query processing and the Apache Arrow DataFusion and Apache Arrow ecosystems. In that capacity, he is a member and past chair of the Apache Arrow PMC and actively contributes to Apache Arrow DataFusion and the Apache Rust implementation query engine. Andrew was a professional C/C++ programmer for 10 years before switching to Rust. His experience ranges from startups to large multinational corporations and distributed open source projects, and has paid leadership dues as an architect and manager/VP. He holds an SB and MEng from MIT in Electrical Engineering and Computer Science. Contributing to Rustacean Station Rustacean Station is a community project; get in touch with us if you'd like to suggest an idea for an episode or offer your services as a host or audio editor! Twitter: @rustaceanfm Discord: Rustacean Station Github: @rustacean-station Email: hello@rustacean-station.org Timestamps [@0:52] - Meet Andrew Lamb, Staff Engineer at InfluxData, working on InfluxDB IOx [@2:57] - Transitioning from C++ to Rust: Andrew's story [@11:24] - InfluxDB rewrite and its use cases [@22:13] - Compatibility of InfluxDB [@26:58] - Downsides of using Rust and other languages [@32:40] - Plans for the 3.0 alpha/beta release and different versions [@34:54] - Unique use of the async runtime Tokio [@55:28] - Rust as a tool for recruitment [@58:16] - Closing discussion Other links Andrew's X Account Using Rustlang's Async Tokio Runtime for CPU-Bound Tasks Using the FDAP Architecture to build InfluxDB 3.0 RustASIA Conf 2025 Credits Intro Theme: Aerocity Audio Editing: Plangora Hosting Infrastructure: Jon Gjengset Show Notes: Plangora Hosts: Allen Wyma
Tomer Shiran is Founder of Dremio, the data lakehouse platform for self-service analytics and AI based on open source frameworks Apache Arrow, which the Dremio team created, and Apache Iceberg. Dremio has raised over $400M from investors including Norwest, Redpoint, Adams Street, Sapphire, Insight, and Lightspeed. They are currently valued at $2B. In this episode, we dig into Tomer's journey from MapR to Dremio, his initial vision for making the data stack more accessible, their first breakthrough with Apache Arrow and a columnar-format approach, focusing first on project-market fit before monetization, adding support for Apache Iceberg, how they're using AI to improve user experiences & more!
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/ Cody Peterson has a diverse work experience in the field of product management and engineering. Cody is currently working as a Technical Product Manager at Voltron Data, starting from May 2023. Previously, they worked as a Product Manager at dbt Labs from July 2022 to March 2023. MLOps podcast #234 with Cody Peterson, Senior Technical Product Manager at Voltron Data | Ibis project // Open Standards Make MLOps Easier and Silos Harder. Huge thank you to Weights & Biases for sponsoring this episode. WandB Free Courses -http://wandb.me/courses_mlops // Abstract MLOps is fundamentally a discipline of people working together on a system with data and machine learning models. These systems are already built on open standards we may not notice -- Linux, git, scikit-learn, etc. -- but are increasingly hitting walls with respect to the size and velocity of data. Pandas, for instance, is the tool of choice for many Python data scientists -- but its scalability is a known issue. Many tools make the assumption of data that fits in memory, but most organizations have data that will never fit in a laptop. What approaches can we take? One emerging approach with the Ibis project (created by the creator of pandas, Wes McKinney) is to leverage existing "big" data systems to do the heavy lifting on a lightweight Python data frame interface. Alongside other open source standards like Apache Arrow, this can allow data systems to communicate with each other and users of these systems to learn a single data frame API that works across any of them. Open standards like Apache Arrow, Ibis, and more in the MLOps tech stack enable freedom for composable data systems, where components can be swapped out allowing engineers to use the right tool for the job to be done. It also helps avoid vendor lock-in and keep costs low. // Bio Cody is a Senior Technical Product Manager at Voltron Data, a next-generation data systems builder that recently launched an accelerator-native GPU query engine for petabyte-scale ETL called Theseus. While Theseus is proprietary, Voltron Data takes an open periphery approach -- it is built on and interfaces through open standards like Apache Arrow, Substrait, and Ibis. Cody focuses on the Ibis project, a portable Python dataframe library that aims to be the standard Python interface for any data system, including Theseus and over 20 other backends. Prior to Voltron Data, Cody was a product manager at dbt Labs focusing on the open source dbt Core and launching Python models (note: models is a confusing term here). Later, he led the Cloud Runtime team and drastically improved the efficiency of engineering execution and product outcomes. Cody started his carrer as a Product Manager at Microsoft working on Azure ML. He spent about 2 years on the dedicated MLOps product team, and 2 more years on various teams across the ML lifecycel including data, training, and inferencing. He is now passionate about using open source standards to break down the silos and challenges facing real world engineering teams, where engineering increasingly involves data and machine learning. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Ibis Project: https://ibis-project.org Apache Arrow and the “10 Things I Hate About pandas”: https://wesmckinney.com/blog/apache-arrow-pandas-internals/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Cody on LinkedIn: https://linkedin.com/in/codydkdc
Talk Python To Me - Python conversations for passionate developers
This episode dives into some of the most important data science libraries from the Python space with one of its pioneers: Wes McKinney. He's the creator or co-creator of pandas, Apache Arrow, and Ibis projects and an entrepreneur in this space. Episode sponsors Neo4j Mailtrap Talk Python Courses Links from the show Wes' Website: wesmckinney.com Pandas: pandas.pydata.org Apache Arrow: arrow.apache.org Ibis: ibis-project.org Python for Data Analysis - Groupby Summary: wesmckinney.com/book Polars: pola.rs Dask: dask.org Sqlglot: sqlglot.com Pandoc: pandoc.org Quarto: quarto.org Evidence framework: evidence.dev pyscript: pyscript.net duckdb: duckdb.org Jupyterlite: jupyter.org Djangonauts: djangonaut.space Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy
An aesthetically-pleasing journey through the history of R, another demonstration of DuckDB's power with analytics, and how webR with shinylive brings new learning life to the Pharmaverse TLG gallery.Episode LinksThis week's curator: Sam Parmar - @parmsam@fosstodon.org (Mastodon) & @parmsam_ (X/Twitter)The Aesthetics Wiki - an R AddendumR Dplyr vs. DuckDB - How to Enhance Your Data Processing Pipelines with R DuckDBTLG Catalog
Talk Python To Me - Python conversations for passionate developers
Do you have data that you pull from external sources or is generated and appears at your digital doorstep? I bet that data needs processed, filtered, transformed, distributed, and much more. One of the biggest tools to create these data pipelines with Python is Dagster. And we are fortunate to have Pedram Navid on the show this episode. Pedram is the Head of Data Engineering and DevRel at Dagster Labs. And we're talking data pipelines this week at Talk Python. Episode sponsors Talk Python Courses Posit Links from the show Rock Solid Python with Types Course: training.talkpython.fm Pedram on Twitter: twitter.com Pedram on LinkedIn: linkedin.com Ship data pipelines with extraordinary velocity: dagster.io dagster-open-platform: github.com The Dagster Master Plan: dagster.io data load tool (dlt): dlthub.com DataFrames for the new era: pola.rs Apache Arrow: arrow.apache.org DuckDB is a fast in-process analytical database: duckdb.org Ship trusted data products faster: www.getdbt.com Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy
The Datanation Podcast - Podcast for Data Engineers, Analysts and Scientists
Alex Merced discusses many of the open source projects aiming to reduce the frictions the heavily fragmented data world. Follow me on Socials:https://bio.alexmerced.com/data
In this episode of the Digital Executive, hosted by Brian Thomas at Coruzant Technologies, we delve into the groundbreaking journey of Tomer Shiran, a pioneer in the big data analytics space and a key figure in the evolution of Dremio. Shiran, with a rich background as Dremio's founding CEO and a former VP product at MapR, discusses the transformative path of Dremio from its inception to becoming a major player in data analytics, serving large enterprise customers and embracing generative AI technologies to enhance user productivity and data accessibility.Shiran shares insights into Dremio's innovative features, like Text to SQL, which converts natural language queries into SQL code, democratizing data querying for users across varying degrees of data literacy. Additionally, he highlights the significant impact of emerging technologies, such as Apache Iceberg and Apache Arrow, on the data analytics and management sector, emphasizing gen AI's potential to revolutionize the field. This episode provides a deep dive into the challenges and opportunities of integrating AI into data analytics platforms, the importance of self-service data access, and the future trends that will shape the industry.
Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design Interview Introduction How did you get involved in the area of data management? Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines? This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture? Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components? One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform? Can you describe the operational/architectural aspects of building a full data engine on top of the FDAP stack? What are the elements of the overall product/user experience that you had to build to create a cohesive platform? What are some of the other tools/technologies that can benefit from some or all of the pieces of the FDAP stack? What are the pieces of the Arrow ecosystem that are still immature or need further investment from the community? What are the most interesting, innovative, or unexpected ways that you have seen parts or all of the FDAP stack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on/with the FDAP stack? When is the FDAP stack the wrong choice? What do you have planned for the future of the InfluxDB IOx engine and the FDAP stack? Contact Info LinkedIn (https://www.linkedin.com/in/pauldix/) pauldix (https://github.com/pauldix) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links FDAP Stack Blog Post (https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/) Apache Arrow (https://arrow.apache.org/) DataFusion (https://arrow.apache.org/datafusion/) Arrow Flight (https://arrow.apache.org/docs/format/Flight.html) Apache Parquet (https://parquet.apache.org/) InfluxDB (https://www.influxdata.com/products/influxdb/) Influx Data (https://www.influxdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/influxdb-timeseries-data-platform-episode-199) Rust Language (https://www.rust-lang.org/) DuckDB (https://duckdb.org/) ClickHouse (https://clickhouse.com/) Voltron Data (https://voltrondata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Velox (https://github.com/facebookincubator/velox) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Trino (https://trino.io/) ODBC == Open DataBase Connectivity (https://en.wikipedia.org/wiki/Open_Database_Connectivity) GeoParquet (https://github.com/opengeospatial/geoparquet) ORC == Optimized Row Columnar (https://orc.apache.org/) Avro (https://avro.apache.org/) Protocol Buffers (https://protobuf.dev/) gRPC (https://grpc.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
How do you avoid the bottlenecks of data processing systems? Is it possible to build tools that decouple storage and computation? This week on the show, creator of the pandas library Wes McKinney is here to discuss Apache Arrow, composable data systems, and community collaboration.
Mehmet Ozan Kabak, Ph.D. joins us to introduce the idea of AI-native application development. Ozan applies his real world experience working on machine learning at Instagram Signals and various other roles in AI & ML. Embark on a transformative journey into the heart of AI infrastructure with Ozan Kabak, a beacon of knowledge in the realms of AI and machine learning. Our enlightening dialogue traverses the landscape of 'AI native' applications, where Ozan's insights bridge the gap between academic theory and industry practice. Through anecdotes from his Stanford days to tales of data infrastructure dilemmas, Ozan demystifies the often-overlooked development hurdles such as model monitoring and the balance between training and inference. This episode promises to illuminate the intricate dance behind the scenes of deploying AI solutions, sparing not a detail on the developer's labor and the pivotal moments that shape the backbone of AI applications.Gain an edge as we unpack the strategic foresight necessary for wielding AI in business; a cautious approach underscored by the significance of a robust data framework and the lurking risks of customer-facing AI systems. Ozan's expertise shines as we introduce Apache Arrow, the open-source project championing data format interoperability, heralding a new era of standardization and best practices. Be prepared to peer into the crystal ball of AI's future with us, where efficiency reigns supreme, and the compute landscape is primed for an overhaul. We grapple with the immense potential and existential considerations of large language models, examining how today's marvels could become tomorrow's masters. Tune in for a session packed with insights that will redefine your perspective on AI's current and future roles.What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.
Fredrik has Matt Topol and Lars Wikman over for a deep and wide chat about Apache Arrow and many, many topics in the orbit of the language-independent columnar memory format for flat and hierarchical data. What does that even mean? What is the point? And why does Arrow only feel more and more interesting and useful the more you think about deeply integrating it into your systems? Feeding data to systems fast enough is a problem which is focused on much less than it ought to be. With Arrow you can send data over the network, process it on the CPU - or GPU for that matter- and send it along to the database. All without parsing, transformation, or copies unless absolutely necessary. Thank you Cloudnet for sponsoring our VPS! Comments, questions or tips? We are @kodsnack, @tobiashieta, @oferlund and @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive. If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi. Links Lars Matt Øredev Matt's Øredev presentations: State of the Apache Arrow ecosystem: How your project can leverage Arrow! and Leveraging Apache Arrow for ML workflows Kallbadhuset Apache Arrow Lars talks about his Arrow rabbit hole in Regular programming SIMD/vectorization Spark Explorer - builds on Polars Null bitmap Zeromq Airbyte Arrow flight Dremio Arrow flight SQL Influxdb Arrow flight RPC Kafka Pulsar Opentelemetry Arrow IPC format - also known as Feather ADBC - Arrow database connectivity ODBC and JDBC Snowflake DBT - SQL to SQL Jinja Datafusion Ibis Substrait Meta's Velox engine Arrow's project management committee (PMC) Voltron data Matt's Arrow book - In-memory analytics with Apache Arrow Rapids and Cudf The Theseus engine - accelerator-native distributed compute engine using Arrow The composable codex The standards chapter Dremio Hugging face Apache Hop - orchestration data scheduling thing Directed acyclic graph UCX - libraries for finding fast routes for data Infiniband NUMA CUDA GRPC Foam bananas Turkish pepper - Tyrkisk peber Plopp Marianne Titles For me, it started during the speaker's dinner Old, dated, and Java A real nerd snipe Identical representation in memory Working on columns It's already laid out that way Pass the memory, as is Null plus null is null A wild perk Arrow into the thing So many curly brackets you need to store Arrow straight through Something data people like to do So many backends The SQL string is for people I'm rude, and he's polite Feed the data fast enough A depressing amount of JSON Arrow the whole way through These are the problems in data Reference the bytes as they are Boiling down to Arrow Data lakehouses Removing inefficiency
Fredrik has Matt Topol and Lars Wikman over for a deep and wide chat about Apache Arrow and many, many topics in the orbit of the language-independent columnar memory format for flat and hierarchical data. What does that even mean? What is the point? And why does Arrow only feel more and more interesting and useful the more you think about deeply integrating it into your systems? Feeding data to systems fast enough is a problem which is focused on much less than it ought to be. With Arrow you can send data over the network, process it on the CPU - or GPU for that matter- and send it along to the database. All without parsing, transformation, or copies unless absolutely necessary. Thank you Cloudnet for sponsoring our VPS! Comments, questions or tips? We are @kodsnack, @tobiashieta, @oferlund and @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive. If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi. Links Lars Matt Øredev Matt’s Øredev presentations: State of the Apache Arrow ecosystem: How your project can leverage Arrow! and Leveraging Apache Arrow for ML workflows Kallbadhuset Apache Arrow Lars talks about his Arrow rabbit hole in Regular programming SIMD/vectorization Spark Explorer - builds on Polars Null bitmap Zeromq Airbyte Arrow flight Dremio Arrow flight SQL Influxdb Arrow flight RPC Kafka Pulsar Opentelemetry Arrow IPC format - also known as Feather ADBC - Arrow database connectivity ODBC and JDBC Snowflake DBT - SQL to SQL Jinja Datafusion Ibis Substrait Meta’s Velox engine Arrow’s project management committee (PMC) Voltron data Matt’s Arrow book - In-memory analytics with Apache Arrow Rapids and Cudf The Theseus engine - accelerator-native distributed compute engine using Arrow The composable codex The standards chapter Dremio Hugging face Apache Hop - orchestration data scheduling thing Directed acyclic graph UCX - libraries for finding fast routes for data Infiniband NUMA CUDA GRPC Foam bananas Turkish pepper - Tyrkisk peber Plopp Marianne Titles For me, it started during the speaker’s dinner Old, dated, and Java A real nerd snipe Identical representation in memory Working on columns It’s already laid out that way Pass the memory, as is Null plus null is null A wild perk Arrow into the thing So many curly brackets you need to store Arrow straight through Something data people like to do So many backends The SQL string is for people I’m rude, and he’s polite Feed the data fast enough A depressing amount of JSON Arrow the whole way through These are the problems in data Reference the bytes as they are Boiling down to Arrow Data lakehouses Removing inefficiency
What are the new ways to describe your data in pandas 2.0? Will the addition of Apache Arrow to the data back end foster the growth of data interoperability? This week on the show, we talk with pandas core developer Marc Garcia about the release of pandas 2.0.
Today's show is all about the world of big data and open source projects, and we've got a real gem to share with you—Voltron Data! They're on a mission to revolutionize the data analytics industry through open standards. To unleash the untapped potential in data, Voltron Data uses cutting-edge tech and provides top-notch support services, with a special focus on Apache Arrow. This open-source framework lets you process data in both flat and hierarchical formats, all packed into a super-efficient columnar memory setup. And that's not all! Meet Ibis—an amazing framework that gives data analysts, scientists, and engineers the power to access their data with a user-friendly and engine-agnostic Python library. Excited to learn more? We've got Josh Patterson, the CEO of Voltron Data, here to give us all the details.
The MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
So why would anyone want to put alot of data into a browser? Well, for a lot of the same reasons that edge computing and distributed computing have become so popular. You get the data a lot closer to the user and you don't have to pay for the compute ;) … this sounds great but as I found out during this conversation it's not as easy as it might seem! There are a lot of trade-offs that need to be evaluated when moving data and analytics to the client. Nick Rabinowitz Senior Staff Software Engineer at Foursquare has a ton of experience with this so he volunteered his time to help us understand more about it. https://location.foursquare.com/ https://studio.foursquare.com/home If you are not familiar with the Arrow data format it might be worth checking out Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead Related podcast episodes that you might find interesting include H3 grid system https://mapscaping.com/podcast/h3-geospatial-indexing-system/ The H3 geospatial indexing system is a discrete global grid system consisting of a multi-precision hexagonal tiling of the sphere with hierarchical indexes. H3 is a really interesting approach to tiling data that was developed by UBER and has been open-sourced. Hex Tiles https://mapscaping.com/podcast/hex-tiles/ If you have not heard of the H3 grid system before listen to that episode first before listening to this one it will add a lot of useful context! Spatial Knowledge Graphs https://mapscaping.com/podcast/spatial-knowledge-graphs/ Foursquare is moving away from spatial joins and focusing on building a knowledge graph. If you are not familiar with graphs this might be a good place to start, also its interesting to hear the reasons for the move from spatial joins to another data structure. Distribution Geospatial Data https://mapscaping.com/podcast/distributing-geospatial-data/ This is interesting if you want to understand more about distributed databases and some of the strategies for doing this. It sounds complicated but this episode is a really good introduction! Cloud Native Geospatial https://mapscaping.com/podcast/cloud-native-geospatial/ This episode give a solid overview of what cloud-native means and some of the current geospatial cloud native formats out there today I am constantly thinking about how I can make this podcast better for you so if you have any ideas or suggestions please let me know! Also, I am thinking of recording a behind-the-scenes episode, is that something you might be interested in? if so what questions do you have?
Talk Python To Me - Python conversations for passionate developers
AI has taken the world by storm. It's gone from near zero to amazing in just a few years. We have ChatGPT, we have Stable Diffusion. But what about Jupyter Notebooks and pandas? In this episode, we meet Justin Waugh, the creator of Sketch. Sketch adds the ability to have conversational AI interactions about your pandas data frames (code and data). It's pretty powerful and I know you'll enjoy the conversation. Links from the show Sketch: github.com Lambdapromp: github.com Python Bytes 320 - Coverage of Sketch: pythonbytes.fm ChatGPT: chat.openai.com Midjourney: midjourney.com Github Copilot: github.com GitHub Copilot Litigation site: githubcopilotlitigation.com Attention is All You Need paper: research.google.com Live Colab Demo: colab.research.google.com AI Panda from Midjourney: digitaloceanspaces.com Ray: pypi.org Apache Arrow: arrow.apache.org Python Web Apps that Fly with CDNs Course: talkpython.fm Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy Sponsors Brilliant 2023 Talk Python Training
A fresh take on open-source funding, Fedora's plan for better encryption out of the box, and our impressions of the latest Ubuntu Beta.
A fresh take on open-source funding, Fedora's plan for better encryption out of the box, and our impressions of the latest Ubuntu Beta.
Pre-release announcement for Go 1.20.1 & 1.19.6 to fix private security issuesPre-release announcement for golang.org/x/image/tiff & golang.org/x/image to fix private security issuesTransparent TelementryGitHub Discussion (now locked)Blog post explaining the problem and proposed solutionGopherCon IsraelApache Arrow 11.0 releasedMatt TopolGitHub profileVoltron DataBook: In-Memory Analytics with Apache ArrowPresentation at SubSurface: Understanding Apache ArrowPresentation at ApacheCon 2022: Apache Arrow and Go: A Match made in DataApache arrow project web siteApache Go libraryFollow Matt on Twitter, LinkedIn or MastodonMatt will be speaking at the free, virtual conference Subsurface on March 1
Talk Python To Me - Python conversations for passionate developers
When you think about processing tabular data in Python, what library comes to mind? Pandas, I'd guess. But there are other libraries out there and Polars is one of the more exciting new ones. It's built in Rust, embraces parallelism, and can be 10-20x faster than Pandas out of the box. We have Polars' creator, Ritchie Vink here to give us a look at this exciting new data frame library. Links from the show Ritchie on Mastodon: @ritchie46@fosstodon.org Ritchie on Twitter: @RitchieVink Ritchie's website: ritchievink.com Polars: pola.rs Apache Arrow: arrow.apache.org Polars Benchmarks: pola.rs Coming from Pandas Guide: github.io Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy Sponsors Taipy User Interviews Talk Python Training
Talk Python To Me - Python conversations for passionate developers
When you think about processing tabular data in Python, what library comes to mind? Pandas, I'd guess. But there are other libraries out there and Polars is one of the more exciting new ones. It's built in Rust, embraces parallelism, and can be 10-20x faster than Pandas out of the box. We have Polars' creator, Ritchie Vink here to give us a look at this exciting new data frame library. Links from the show Ritchie on Mastodon: @ritchie46@fosstodon.org Ritchie on Twitter: @RitchieVink Ritchie's website: ritchievink.com Polars: pola.rs Apache Arrow: arrow.apache.org Polars Benchmarks: pola.rs Coming from Pandas Guide: github.io Watch this episode on YouTube: youtube.com --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy Sponsors Taipy User Interviews Talk Python Training
Josh Patterson (@datametrician, Co-Founder & CEO @VoltronData) talks about the concept of composable data analytics and how it benefits our industry. What is it, why should be using it, and how to get started.SHOW: 694CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwNEW TO CLOUD? CHECK OUT - "CLOUDCAST BASICS"SHOW SPONSORS:Solve your IAM mess with Strata's Identity Orchestration platformHave an identity challenge you thought was too big, too complicated, or too expensive to fix? Let us solve it for you! Visit strata.io/cloudcast to share your toughest IAM challenge and receive a set of AirPods ProHow to Fix the Internet (A new podcast from the EFF)Datadog Kubernetes Solution: Maximum Visibility into Container EnvironmentsStart monitoring the health and performance of your container environment with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirtSHOW NOTES:Voltron Data (homepage)Apache Arrow (homepage)CRN 10 Hottest Big Data Startups of 2022 (CRN)Voltron grabs 110M Series A (TechCrunch)Topic 1 - Hello Josh and welcome to the show. You have a very diverse and interesting background. Can you give everyone a quick introduction? As a follow up, tell everyone a little bit about your experience as Presidential Innovation Fellow.Topic 2 - Before we dig into Voltron Data, we need to tell everyone about Apache Arrow. Business and organizations tend to be overwhelmed by big data. Everything from the volume, to the tools, to the lack of data scientists and practitioners. Can you give everyone an overview of Arrow, how it came to be, what problem does it solve? Topic 3 - Arrow has companies like Snowflake, NetFlix, Meta, Databricks, Google and Microsoft all adopting it. Our listeners will be more familiar with Snowflake & Databricks and their business models, what makes Voltron Data different? How are you building a company on top of OSS?Topic 4 - Let's talk about communities and standards. I've seen various numbers on Arrow and monthly downloads, always in the tens of millions per month. Your focus appears to be providing services for Arrow and other Apache projects to simplify open source for those that don't have the skills or time, while also working towards the goal of community standards. Is that correct?Topic 5 - How will open source standards for data help the data analytics industry move faster? Is this a process problem? A data set problem? A tools problem?Topic 6 - Data Analytics has a reputation for a high barrier to entry. If our listeners are interested, how can they get started?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
Summary Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence Interview Introduction How did you get involved in the area of data management? Can you describe what Omni Analytics is and the story behind it? What are the core goals that you are trying to achieve with building Omni? Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market? What are the technical and organizational anti-patterns that typically grow up around BI systems? What are the elements that contribute to BI being such a difficult product to use effectively in an organization? Can you describe how you have implemented the Omni platform? How have the design/scope/goals of the product changed since you first started working on it? What does the workflow for a team using Omni look like? What are some of the developments in the broader ecosystem that have made your work possible? What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses? What are the most interesting, innovative, or unexpected ways that you have seen Omni used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni? When is Omni the wrong choice? What do you have planned for the future of Omni? Contact Info LinkedIn (https://www.linkedin.com/in/merrickchristopher/) @cmerrick (https://twitter.com/cmerrick) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Omni Analytics (https://www.exploreomni.com/) Stitch (https://www.stitchdata.com/) RJ Metrics (https://en.wikipedia.org/wiki/RJMetrics) Looker (https://www.looker.com/) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) Singer (https://www.singer.io/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Teradata (https://www.teradata.com/) Fivetran (https://www.fivetran.com/) Apache Arrow (https://arrow.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) BigQuery (https://cloud.google.com/bigquery) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Wes McKinney is the creator of pandas, co-creator of Apache Arrow, and now Co-founder/CTO at Voltron Data. In this conversation with Tristan and Julia, Wes takes us on a tour of the underlying guts, from hardware to data formats, of the data ecosystem. What innovations, down to the hardware level, will stack to lead to significantly better performance for analytics workloads in the coming years? To dig deeper on the Apache Arrow ecosystem, check out replays from their recent conference at https://thedatathread.com. For full show notes and to read 7+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.
Wes McKinney, CTO & Co-Founder of Voltron Data, joins me for an in-depth conversation on how his quest to develop Python as an open-source programming language led him to creating the pandas project and founding four companies. In this episode, Wes and I dive into his unique background as the founder of the pandas project and he describes his perspective on the early days of Python, his journey into the world of open-source start-ups, and the risks and benefits of paying developers to work on open-source projects. Highlights: Wes introduces himself and describes his role (00:46) Wes' role in elevating Python to a mainstream programming language (02:15) How working with Python led Wes to co-founding his first two companies (09:01) Apache Arrow's critical role at Voltron Data and their focus on accelerating Arrow adoption (12:52) How did the team at Voltron Data decide on an open-source business model? (18:54) Wes speaks to the risk that can come from having developers work on an open-source project (22:31) Wes' perspective on the real-world applications and benefits of paying developers to work on open-source projects (27:44) Links:Wes LinkedIn: https://www.linkedin.com/in/wesmckinn/ Twitter: https://twitter.com/wesmckinn Company: https://voltrondata.com/
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead. Wes McKinney is the CEO of Ursa Computing, a new startup working on accelerated computing The post Arrow Infrastructure with Wes McKinney appeared first on Software Engineering Daily.
DuckDB est une base de données OLAP in-process et très légère : très utile pour la préparation de données en SQL Surtout quand cette base est compilée en C++, très performante, dotée de nombreuses fonctions, capable de lire et écrire des fichiers csv ou parquet et utilisable via ODBC,JDBC, API python ou C++ ou tout simplement via CLI. Apache Arrow est un format pour la représentation colonnaire données analytiques et permet d'éviter la serialisation et la perte de temps associée.
Tomer Shiran is Co-Founder and CPO of Dremio, a Bay Area-based, high-performance, forever-free lakehouse platform that builds on an open data architecture by the creators of Apache Arrow. Prior to founding Drimeo, Tomer was an early employee at MapR Technologies where he ran product management. Tomer dives into their plans at Dremio to impact the data industry through better, more scalable platforms which will allow companies to have a data platform they can truly rely on. Listen in to hear more about Tomer's background, how Dremio got started, and their plans for the future. Show Notes: Check out Dremio: https://www.dremio.com/ Learn more from Dremio's Blog: https://www.dremio.com/blog/ Connect with Tomer on LinkedIn: https://www.linkedin.com/in/tshiran/ On tap for today's episode: Cappucino & Espresso Contact Us: https://www.hashmapinc.com/reach-out
We had heard about Apache Arrow and Arrow Flight as being a hi-performing database with access speeds to match for a while now and finally got a chance to hear what it was all about with James Duong, Co-Fourder of Bit Quill Technologies/Senior Staff Developer at Dremio and David Li (@lidavidm), Apache PMC and software … Continue reading "130: GreyBeards talk high-speed database access using Apache Arrow Flight, with James Duong and David Li"
We had heard about Apache Arrow and Arrow Flight as being a hi-performing database with access speeds to match for a while now and finally got a chance to hear what it was all about with James Duong, Co-Fourder of Bit Quill Technologies/Senior Staff Developer at Dremio and David Li (@lidavidm), Apache PMC and software … Continue reading "130: GreyBeards talk high-speed database access using Apache Arrow Flight, with James Duong and David Li"
This episode features an interview with Tomer Shiran, Founder and Chief Product Officer at Dremio. Dremio is a high-performance SQL lakehouse platform that helps companies get more from their data in the fastest way possible. Prior to Dremio, Tomer served as VP of Product at MapR and also held product management and engineering roles at Microsoft and IBM Research. He also has a master's degree from Carnegie Mellon University as well as a bachelor's from Technion - Israel Institute of Technology.In this episode, Tomer and Sam dive into the economics of storing data, how to build an open architecture, and what exactly a data lakehouse is.-------------------“I think in the world of data lakes and lakehouses, the model has shifted upside down. Now, instead of bringing the data into the engines, you're actually bringing the engines to the data. So you have this open data tier built on open source technology. The data is represented in open source formats and stored in the company's S3 account or Azure storage account. And then you can use a variety of engines. We at Dremio, we take pride in building the best SQL engine to use on the data. There are different streaming engines, like Spark and Flink. There are different batch processing and machine learning engines. Spark is an example of that as well that companies can use on that same data. And I think that's one of the really important things from a cost standpoint, too, is that this really lowers your overall costs, both today and also in the future as you scale.” – Tomer Shiran-------------------Episode Timestamps:(02:04): What open source data means to Tomer(03:14): Tomer's motivation behind Apache Arrow(06:42): How Tomer solved data accessibility (08:43): The unit economics of storing data(14:31): Tomer's motivations for Iceberg and how it relates to Project Nessie(17:06): What is a data lakehouse?(18:31): What gives Dremio its magic?(23:39): What cloud data architecture will look like in 5 years(27:19): Advice for building an open data architecture-------------------Links:LinkedIn - Connect with TomerLinkedIn - Connect with DremioTwitter - Follow TomerTwitter - Follow DremioVisit DremioGet started with Dremio
In this episode: We discuss the world of protocols like tcp and httpThe protocols for API building (REST, GraphQL, RPC)Apache Arrow and why it is a great RPC use case MY RECENT ARTICLE ON APACHE ARROW FLIGHT SQL: https://www.dremio.com/subsurface/an-introduction-to-apache-arrow-flight-sql/ MY RECENT ARTICLE ON RPC: https://dev.to/alexmercedcoder/understanding-rpc-tour-of-api-protocols-grpc-nodejs-walkthrough-and-apache-arrow-flight-55bd LINK TO REGISTER FOR SUBSURFACE CONFERENCE: https://www.dremio.com/subsurface/live/winter2022/?utm_medium=social&utm_source=dremio&utm_term=alexmercedsocial&utm_content=na&utm_campaign=event-subsurface-2022
Voltron Data was launched last year by former employees from NVidia, Ursa Computing, BlazingSQL and the co-founder of Apache Arrow.
Building the Backend: Data Solutions that Power Leading Organizations
In this episode we speak with Matt Topol, Vice President, Principal Software Architect @ FactSet and dive deep into how they are taking advantage of Apache Arrow for faster processing and data access. Below are the top 3 value bombs:Apache Arrow is an open-source in-memory columnar format that creates a standard way to share and process data structures.Apache Arrow Flight eliminates serialization and deserialization which enables faster access to query results compared to traditional JDBC and ODBC interfaces.Don't put all your eggs in one basket, whether you're using commercial products or open source, make sure you design a modular architecture that does not tie you down to any one piece of technology.
В этом выпуске: JetBrain Fleet рассказанная разработчиками, обсужденная со всех сторон; разбор доклада про Apache Arrow в рамках рубрики про БД; пара анонсов конференций, и темы слушателей; и небольшой GameZen в конце. Шоуноты: [00:01:02] Интервью с гостями Fleet: Next-generation IDE by JetBrains [00:16:52] Что мы узнали за неделю [01:23:38] Невыносимая тема — Apache Arrow:… Читать далее →
Wes McKinney joins us to discuss the history and philosophy of pandas and Apache Arrow as well as his continued work in open source tools. In this episode you will learn: • History of pandas [7:29] • The trends of R and Python [23:33] • Python for Data Analysis [25:58] • pandas updates and community [30:10] • Apache Arrow [41:50] • Voltron Data [55:10] • Origin of Wes's project names [1:08:14] • Wes's favorite tools [1:09:46] • Audience Q&A [1:15:34] Additional materials: www.superdatascience.com/523
Julien has a unique history of building open frameworks that make data platforms interoperable. He's contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework for data lineage collection and analysis. In this episode, Tristan & Julia dive into how open source projects grow to become standards, and why data lineage in particular is in need of an open standard. They also cover into some of the compelling use cases for this data lineage metadata, and where you might be able to deploy it in your work. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.
Do you want to know the latest in big data analytics frameworks? Have you ever heard of Apache Arrow? Rust? Ballista? In this episode I speak with Andy Grove one of the main authors of Apache Arrow and Ballista compute engine. Andy explains some challenges while he was designing the Arrow and Ballista memory models and he describes some amazing solutions. Our Sponsors If building software is your passion, you'll love ThoughtWorks Technology Podcast. It's a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that's piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery. Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business. References https://arrow.apache.org/ https://ballistacompute.org/ https://github.com/ballista-compute/ballista
Support the show (http://paypal.me/SachinPanicker )
In this episode of the podcast, Tosha Ellison and Grizz Griswold (both of FINOS) interview Andrew Stein, Executive Director at J.P. Morgan Chase, and Lead Maintainer on the FINOS Perspective open source project. We discuss the project itself, its genesis, problems that Perspective solves for what users, and then pull in examples of what the project and the software can do. We also look at who should, and why they should get involved in consuming and contributing to the open source project. Andrew also gave a presentation and demo on "How to Build an Order Book Simulation with Perspective" last month - so check that out too: https://www.finos.org/blog/us-open-source-in-fintech-meetup-31-march-21 BACKGROUND FOR THE FINOS PERSPECTIVE PROJECT Perspective is an interactive visualization component for large, real-time datasets. Originally developed for J.P. Morgan's trading business, Perspective makes it simple to build real-time & user configurable analytics entirely in the browser, or in concert with Python and/or Jupyterlab. Use it to create reports, dashboards, notebooks and applications, with static data or streaming updates via Apache Arrow. As a library, Perspective provides both: A fast, memory efficient streaming query engine, written in C++ and compiled for both WebAssembly and Python, with read/write/stream/virtual support for Apache Arrow. A framework-agnostic User Interface Custom Element and Jupyterlab Widget, via WebWorker (WebAssembly) or virtually via WebSocket (Python/Node), and a suite of Datagrid and D3FC Chart plugins. Website https://perspective.finos.org/ GitHub Repo https://github.com/finos/perspective/ Case Study https://www.finos.org/blog/perspective-project-case-study Andrew Stein, Executive Director, J.P. Morgan Chase Andrew has been a web developer for 15 years. Despite winning the 2018 Nueske’s Bacon Night Award as a member of team “Lard and In Charge” at “Hogs for the Cause” BBQ festival, Andrew rejected a life of perennial BBQ fame and returned to programming full-time where he currently works on Perspective at JPMC. ►► Visit here for more FINOS Meetups - https://www.finos.org/hosted-events ►► Visit FINOS www.finos.org ►► Get In Touch: info@finos.org
Do you want to know the latest in big data analytics frameworks? Have you ever heard of Apache Arrow? Rust? Ballista? In this episode I speak with Andy Grove one of the main authors of Apache Arrow and Ballista compute engine. Andy explains some challenges while he was designing the Arrow and Ballista memory models and he describes some amazing solutions. Our Sponsors This episode is supported by Chapman's Schmid College of Science and Technology, where master's and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey. To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience If building software is your passion, you'll love ThoughtWorks Technology Podcast. It's a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that's piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery. References https://arrow.apache.org/ https://ballistacompute.org/ https://github.com/ballista-compute/ballista
Our guest this week in the one and only Wes McKinney, creator of Pandas and Apache Arrow. We have a great conversation about his career journey, funding and maintaining open-source software projects, his new company Ursa Computing, how Pandas grew from a passion project to the lingua franca of Python data science, and a lot more.
Uwe Korn ist Data Engineer und engagiert sich seit mehreren Jahren in verschiedenen Open Source Projekten, insbesondere Apache Parquet und Apache Arrow. Apache Parquet ist ein spaltenorientiertes Speicherformat für tabellarische Daten, mit einer guten Schreib- und Leseperformance für Batch-Prozesse. Parquet erfasst dazu beim Schreiben die Datentypen und zahlreiche Metriken, um mit eingebauter Komprimierung die Dateigröße deutlich zu komprimieren. Dazu reden wir auch über andere Datenformate wie Avro, CSV, ORC, Hdf5 und Feather. Apache Arrow ist ein In-Memory Speicherformat für Daten, welches die Brücke zwischen zahlreichen den Programmiersprachen schlägt. Dadurch wird es möglich, in C-Code, Java, Rust oder einer der anderen implementierten Sprachen auf die gleichen Daten zuzugreifen. Uwe erklärt uns, wie diese Sprach-Brücke funktioniert und wie Arrow zukünftig nicht nur zur Haltung sondern auch zur Verarbeitung von Daten eingesetzt werden kann. Zum Abschluss befrage ich Uwe zu seinem Engagement im Open Source Umfeld. Wie hat er den Einstieg gefunden? Wie lässt sich Open Source mit Beruf und Privatleben vereinbaren? Und worauf sollte man achten, wenn man selbst ein Open Source Projekt unterstützen möchte? Weiter Links: ChanZuckerberg-Stiftung unterstützen Arrow
Apache Arrow is an in-memory data structure for use by engineers for building data systems.Support the show (http://paypal.me/SachinPanicker )
In this episode of the Data Exchange I speak with Wes McKinney, Director of Ursa Labs and an Apache Arrow PMC Member. Wes is the creator of pandas, one of the most widely used Python libraries for data science. He is also the author of the best-selling book, “Python for Data Analysis” – a book that has become essential reading for both aspiring and experienced data scientists.Our conversation focused on data science tools and other topics including:Two open source projects Wes has long been associated with: pandas and Apache Arrow.The need for a shared infrastructure for data science.Ursa Labs: its mission and structure.Detailed show notes can be found on The Data Exchange web site.Subscribe to The Gradient Flow Newsletter.
Tim Hall is the VP of Products for InfluxData, the creators of the open-source time-series platform InfluxDB. Their technology is purpose-built to handle the massive volumes of time-stamped data produced by IoT devices, applications, networks, containers, and computers. Kieran and Tim discuss working with cutting-edge technology companies and how great products are built. Topics include Apache Arrow, Flight, InfluxDB, Templates, and learning from their failures. Show Notes: Tim's Github: https://github.com/orgs/influxdata/people/timhallinflux Tim's Twitter: @thallinflux On tap for today’s episode: Tazo Zen Green Tea & Bewley’s Irish Afternoon Tea Contact Us: https://www.hashmapinc.com/reach-out
Apache Arrow is a cross-language development platform for in-memory data. It supports zero-copy streaming messaging and has support for a number of languages, including C, C++, Python, R, Rust, and many others.
Data Futurology - Data Science, Machine Learning and Artificial Intelligence From Industry Leaders
Tomer Shiran is the Co-Founder and CEO of Dremio, Dremio is the Data-as-a-Service Platform company. Created by veterans of open source and big data technologies, and the creators of Apache Arrow, Dremio is a fundamentally new approach to data analytics that helps companies get more value from their data, faster. Dremio makes data engineering teams more productive and data consumers more self-sufficient. Tomer Shiran previously headed the product management team at MapR and was responsible for product strategy, roadmap, and requirements. Before MapR, Tomer held numerous product management and engineering roles at Microsoft, most recently as the product manager for Microsoft Internet Security & Acceleration Server (now Microsoft Forefront). He is the founder of two websites that have served tens of millions of users and received coverage in prestigious publications such as The New York Times, USA Today, and The Times of London. Tomer is also the author of a 900-page programming book. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion - Israel Institute of Technology. Enjoy the show! We speak about: [01:50] How Tomer started in the data space [03:35] What was it like running your own business? [04:50] What did you think would happen with ePassportPhoto? [07:20] Takeaways from MapR [09:35] What was the process of starting Dremio? [10:55] How did you gauge how much product development needed to be done? [12:20] Where did you start with your hiring process? [13:00] What have been some of the pivotal moments for Dremio? [14:35] What does Dremio do? [16:00] What is the semantic layer? [20:00] Who are the users? [23:30] What are the data masking capabilities? [25:10] How has the journey been for you personally? [28:35] What challenges are you facing right now? [30:00] About Tomer’s teams [31:15] The importance of having a sales team [33:30] How has Dremio changed with the increase of employees? [34:30] What does the future look like for Dremio? [35:00] What do international expansions look like for Dremio? [35:45] What are you most proud of in your career? [36:20] Any lessons from your failures? [37:45] Advice for future entrepreneurs [40:00] Future challenges in the data space [41:45] A piece of advice for the listeners Thank you to our sponsors: Fyrebox - Make Your Own Quiz! RMIT Online Master of Data Science Strategy and Leadership Gain the advanced strategic, leadership and data science capabilities required to influence executive leadership teams and deliver organisation-wide solutions. We are RUBIX. - one of Australia’s leading pure data consulting companies delivering project outcomes for some of the world’s leading brands. Visit online.rmit.edu.au for more information And as always, we appreciate your Reviews, Follows, Likes, Shares and Ratings. Thank you so much for listening. Enjoy the show! --- Send in a voice message: https://anchor.fm/datafuturology/message
関連リンク Preferred Networks KDD 2019 | Chainer: a Deep Learning Framework for Accelerating the Research Cycle chainer/chainerio Jubatus : オンライン機械学習向け分散処理フレームワーク PFN、3機種めのディープラーニング用スパコンを2019年7月に稼働、合計で200ペタFLOPSに 確率的勾配降下法 Lustre (ファイルシステム) C API libhdfs Amazon CloudFront ケルベロス認証 Hadoop and Kerberos Apache Hadoop Ozone 小さなファイルが大きな問題を引き起こす:Hadoopクラスターでのスモールファイルの予防と対処について Kubernetesに分散ストレージのCephを統合する「Rook」がCNCFの正式プロジェクトに。ファイル、ブロック、S3互換オブジェクトストレージやマルチリージョン対応も Python bindings - Apache Arrow SRE サイトリライアビリティエンジニアリング ―Googleの信頼性を支えるエンジニアリングチーム Autonomous Tidying-up Robot System NSDI ‘19 OSDI ‘18 「ところてんって会社で何やってるの?なんのエンジニアだっけ?」 と部長から言われた。 hadoopとhiveとmysqlと機械学習とログ解析と自然言語処理とVBAと火消しをやっているが、めんどくさいので「高機能雑用」と答えておいた。 だいたい間違ってない。 Preferred Networks Careers GTC Silicon Valley-2019: MagLev: A Production-grade AI Platform Running on GPU-enabled Kubernetes Clusters
関連リンク Preferred Networks KDD 2019 | Chainer: a Deep Learning Framework for Accelerating the Research Cycle chainer/chainerio Jubatus : オンライン機械学習向け分散処理フレームワーク PFN、3機種めのディープラーニング用スパコンを2019年7月に稼働、合計で200ペタFLOPSに 確率的勾配降下法 Lustre (ファイルシステム) C API libhdfs Amazon CloudFront ケルベロス認証 Hadoop and Kerberos Apache Hadoop Ozone 小さなファイルが大きな問題を引き起こす:Hadoopクラスターでのスモールファイルの予防と対処について Kubernetesに分散ストレージのCephを統合する「Rook」がCNCFの正式プロジェクトに。ファイル、ブロック、S3互換オブジェクトストレージやマルチリージョン対応も Python bindings - Apache Arrow SRE サイトリライアビリティエンジニアリング ―Googleの信頼性を支えるエンジニアリングチーム Autonomous Tidying-up Robot System NSDI ‘19 OSDI ‘18 「ところてんって会社で何やってるの?なんのエンジニアだっけ?」 と部長から言われた。 hadoopとhiveとmysqlと機械学習とログ解析と自然言語処理とVBAと火消しをやっているが、めんどくさいので「高機能雑用」と答えておいた。 だいたい間違ってない。 Preferred Networks Careers GTC Silicon Valley-2019: MagLev: A Production-grade AI Platform Running on GPU-enabled Kubernetes Clusters
00:09:36 - Monad error 00:14:55 - Scala митап - Нижний Новгород 00:18:11 - Mитап екб 00:35:44 - Тестирование 00:52:52 - Li Hayou - How to work with Files in Scala 00:56:13 - Github Trending 00:58:53 - Хардварная сторона: Berkeley Out-of-Order Machine Rocket Chip Generator Flexible Intermediate Representation for RTL ??:??:?? - Renaissance-benchmark ??:??:?? - Гриша побенчил Apache Arrow Поддержи подкаст https://www.patreon.com/scalalalaz Голоса выпуска: Вадим Челышов, Алексей Фомкин, Евгений Токарев, Григорий Помадчин
Python has become one of the dominant languages for data science and data analysis. Wes McKinney has been working for a decade to make tools that are easy and powerful, starting with the creation of Pandas, and eventually leading to his current work on Apache Arrow. In this episode he discusses his motivation for this work, what he sees as the current challenges to be overcome, and his hopes for the future of the industry.
Another spectacular rstudio::conf is in the books and the R-Podcast has tons of insights to share! We kick off our coverage with a three-podcast crossover as I am joined by Credibly Curious co-host Nick Tierny and Not So Standard Deviations co-host Hilary Parker! We discuss our impressions of the conference and where we'd like to see R go in 2019. Plus I share how my journey to the Advanced R-Markdown workshop is a testament to the welcoming and openness that the R community offers. This is just the beginning of our coverage and I hope you enjoy this episode! Conversation with Hilary Parker and Nick Tierney Credibly Curious podcast: soundcloud.com/crediblycurious (https://soundcloud.com/crediblycurious) Not So Standard Deviations podcast: nssdeviations.com (http://nssdeviations.com/) Apache Arrow: arrow.apache.org (https://arrow.apache.org/) Tidy Evaluation online book: tidyeval.tidyverse.org (https://tidyeval.tidyverse.org/) Tidy models family of packages: github.com/tidymodels (https://github.com/tidymodels) The magick package by Jeroen Ooms: github.com/ropensci/magick (https://github.com/ropensci/magick) pagedown package (paginate HTML output of R Markdown) by Yihui Xie: github.com/rstudio/pagedown (https://github.com/rstudio/pagedown) Advanced R Markdown workshop highlights Course website: arm.rbind.io (https://arm.rbind.io/) (powered by blogdown (https://bookdown.org/yihui/blogdown/)!) Course GitHub repository: github.com/rstudio-education/arm-workshop-rsc2019 (https://github.com/rstudio-education/arm-workshop-rsc2019) My slides on using the officer package to create PowerPoint slides: rpodcast.github.io/officer-advrmarkdown (https://rpodcast.github.io/officer-advrmarkdown) The officer package documenation: davidgohel.github.io/officer (https://davidgohel.github.io/officer/) MegaMan slide generator Shiny app: rpodcast.shinyapps.io/megaman (https://rpodcast.shinyapps.io/megaman/) GitHub repo for slides and app: github.com/rpodcast/officer-advrmarkdown (https://github.com/rpodcast/officer-advrmarkdown) Feedback Leave a comment on this episode's post: r-podcast.org/26 (https://r-podcast.org/26) Email the show: thercast[at]gmail.com Use the R-Podcast contact page: r-podcast.org/contact (https://r-podcast.org/contact) Leave a voicemail at +1-269-849-9780 Music Credits Opening and closing themes: Training Montage by WillRock (http://ocremix.org/artist/5043/willrock) from the Return All Robots Remix Album (http://ocremix.org/events/returnallrobots/) at ocremix.org (http://ocremix.org/)
Wes McKinney is the creator and "Benevolent Dictator for Life" (BDFL) of the open-source pandas package for data analysis in Python, and has also authored two versions of the reference book Python for Data Analysis. Wes is also one of the co-creators of the Apache Arrow project, which is currently his main focus. Most recently, he is the founder Ursa Labs, a not-for-profit open source development group in partnership with RStudio. He describes himself as a problem-solver, and is particularly interested in improving the usability of data tools for programmers, accelerating data access and in-memory data processing performance, and improving data system interoperability. In my conversation with Wes today, we focused on getting to know Wes on a more personal level, discussing his background and interests to get some insight into the living legend of open source he has become. [3:48] How did coming from four generations of newspaperman impact Wes’s upbringing? [6:00] What kind of hobbies was he interested in growing up, and what is the origin of his interest in computers? [11:08] How did he come to run a Goldeneye 007 world record website, and update and maintain it by hand? [16:10] Wes’s high school career as a mathlete, and how an early interest in math contributed to his approach to programming. [18:15] How wes brings the rigor he learned in mathematics to software engineering. [19:50] How languages and math scratch the same itch for composition. [21:00] About learning enough German to complete a PhP programming internship in Munich. [23:00] How Wes’s experience using data in his first year working post-undergrad set him down the path to Pandas. [25:00] What went into his decision to take leave from grad school to build Pandas? [27:00] The legendary tweet where Wes expressed his sense of purpose and motivation in building Pandas. [29:52] Why Wes’s work is motivated by the desire to free up people’s time to realize their full potential. [30:51] Zero to One - Peter Thiel [31:40] Why is solving basic efficiency problems, like reading CSV files. so important? [34:12] How community management has played such a huge role in making Pandas so successful compared to other tools. [39:00] The importance of seeing peers in an open source project as people with good intentions and more than just a GitHub profile. [46:00] How do the incentives of an open source project influence prioritization in a project? [51:45] How Wes’s newest project, UrsaLabs, is tackling the problem of funding in open source software development. [56:20] Wes’s goals for UrsaLabs over the next five years. AJ’s Twitter: https://twitter.com/ajgoldstein393 Wes’s Twitter:https://twitter.com/wesmckinn Wes’s personal website: http://wesmckinney.com Wes’s LinkedIn: https://www.linkedin.com/in/wesmckinn/
Holden Karau is on the podcast this week to talk all about Spark and Beam, two open source tools that helps process data at scale, with Mark and Melanie. Holden Karau Holden Karau is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, BEAM, and related “big data” tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a commiter on and PMC on Apache Spark and committer on SystemML & Mahout projects. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Cool things of the week Twitter’s collaboration with Google Cloud blog & tweet Kaggle CERN TrackML Particle Tracking Challenge Competition site Open-sourcing gVisor, a sandboxed container runtime blog & repo Announcing Stackdriver Kubernetes Monitoring blog MLPerf: collaborative effort to standardize ML benchmarks site Interview Spark site & community site Beam site Cloud Dataflow site & docs Cloud Dataproc site & docs Using Spark on Kubernetes Engine blog Testing future Apache Spark releases and changes on Google Kubernetes Engine and Cloud Dataproc blog Spark Packages site Spark testing base repo Flink site Arrow site Upcoming Talks: PyCon 2018 & Debugging PySpark talk Scala Days & Keeping the “fun” in Spark talk Strata London & Understanding Spark tuning with auto-tuning talk J on the Beach & General Purpose Big Data Systems are eating the world talk Spark Summit 2018 & Accelerating TF with Apache Arrow on Spark talk Question of the week I have a continuous integration build process setup with Container Builder, but it’s all sequential. I want to speed things up by processing parts of it in parallel. How do I do that? Configure Build Step Order docs Where can you find us next? Mark can be found streaming Agones development on Twitch. Melanie is speaking at the internet2 Global Summit, May 9th in San Diego, and will also be talking at the Understand Risk Forum on May 17th, in Mexico City. Special shout out: Google I/O and PyCon are both happening this week
Mark is joined in this episode of Drill to Detail by Wes McKinney, to talk about the origins of the Python Pandas open-source package for data analysis and his subsequent work as a contributor to the Kudu (incubating) and Parquet projects within the Apache Software Foundation and Arrow, an in-memory data structure specification for use by engineers building data systems and the de-facto standard for columnar in-memory processing and interchange.
Mark is joined in this episode of Drill to Detail by Wes McKinney, to talk about the origins of the Python Pandas open-source package for data analysis and his subsequent work as a contributor to the Kudu (incubating) and Parquet projects within the Apache Software Foundation and Arrow, an in-memory data structure specification for use by engineers building data systems and the de-facto standard for columnar in-memory processing and interchange.
Mark Rittman is joined by MapR's Neeraja Rentachintala to talk about Apache Drill, Apache Arrow, MapR-DB, extending Hadoop-based data discovery to self-describing file formats and NoSQL databases, and why MapR backed Drill as their strategic SQL-on-Hadoop platform technology.
Mark Rittman is joined by MapR's Neeraja Rentachintala to talk about Apache Drill, Apache Arrow, MapR-DB, extending Hadoop-based data discovery to self-describing file formats and NoSQL databases, and why MapR backed Drill as their strategic SQL-on-Hadoop platform technology.
Hilary and Roger talk about the difficulties of separating data analysis from its context, and Feather, a new file format for storing tabular data. Also, Hilary and Roger respond to some listener questions and Hilary announces her new job. If you have questions you’d like us to answer, you can send them to nssdeviations@gmail.com or tweet us at @NSSDeviations. Show notes: NSSD Patreon page (https://www.patreon.com/NSSDeviations) Feather git repository (https://github.com/wesm/feather/) Apache Arrow (https://arrow.apache.org) FlatBuffers (https://google.github.io/flatbuffers/) Roger’s blog post on feather (http://simplystatistics.org/2016/03/31/feather/) NausicaaDistribution (https://www.etsy.com/shop/NausicaaDistribution) New York R Conference (http://www.rstats.nyc) Every Frame a Painting (https://goo.gl/J2QAWK)