Podcasts about apache pinot

20PODCASTS
45EPISODES
38mAVG DURATION
?INFREQUENT EPISODES
Mar 3, 2026LATEST

POPULARITY

20192020202120222023202420252026

Best podcasts about apache pinot

Real-Time Analytics with Tim Berglund

17 episodes with apache pinot

The GeekNarrator

7 episodes with apache pinot

Engenharia de Dados [Cast]

2 episodes with apache pinot

TechCrunch Startups – Spoken Edition

2 episodes with apache pinot

Latest podcast episodes about apache pinot

Building Planetary-Scale Data Systems with Venice • Felix GV & Olimpiu Pop

GOTO - Today, Tomorrow and the Future

Play Episode Listen Later Mar 3, 2026 28:38

This interview was recorded for GOTO Unscripted.https://gotopia.techCheck out more here:https://gotopia.tech/articles/421Félix GV - Current Interests: Multi-Planetary Databases, Data Sovereignty & LifeloggingOlimpiu Pop - Technologist & Tech JournalistRESOURCESFélixhttps://bsky.app/profile/felixgv.ninjahttps://github.com/FelixGVhttps://www.linkedin.com/in/felixgvOlimpiuhttps://x.com/olimpiupophttps://github.com/zrollhttps://www.linkedin.com/in/olimpiupopLinkshttps://venicedb.orghttps://github.com/linkedin/venicehttps://rocksdb.orghttps://duckdb.orgDESCRIPTIONFélix GV, a former engineer at LinkedIn and architect of the Venice database system, discusses the complexity of building planetary-scale data systems. He explains Venice's unbundled architecture where each component—from Kafka-based pub/sub to RocksDB-powered servers—operates as an independent distributed system. Félix details their rigorous chaos engineering practices, including regular load tests that push data centers beyond normal capacity to ensure reliability.The discussion covers fundamental distributed systems concepts like the CAP theorem and the trade-offs between consistency and availability in multi-region deployments. He also explains why Venice, as a derived data system, deliberately sacrifices strong consistency for high throughput and availability, and concludes by discussing their experimental integration of DuckDB for SQL-based analytics and data exploration capabilities.RECOMMENDED BOOKSKasun Indrasiri & Danesh Kuruppu • gRPC: Up and Running • https://amzn.to/3sBGBJJTomer Shiran, Jason Hughes & Alex Merced • Apache Iceberg: The Definitive Guide • https://amzn.to/488Z30kWilliam Smith • Arrow Flight Protocols and Practices • https://amzn.to/4o2Q2fdAdi Polak • Scaling Machine Learning with Spark • https://amzn.to/3N9vx1HMark Needham, Michael Hunger & Michael Simons • DuckDB in Action • https://amzn.to/45QwSliSimon Aubury & Ned Letcher • Getting Started with DuckDB • https://amzn.to/3VPk4qBlueskyInstagramLinkedInFacebookCHANNEL MEMBERSHIP BONUSJoin this channel to get early access to videos & other perks:https://www.youtube.com/channel/UCs_tLP3AiwYKwdUHpltJPuA/joinLooking for a unique learning experience?Attend the next GOTO conference near you! Get your ticket: gotopia.techSUBSCRIBE TO OUR YOUTUBE CHANNEL - new videos posted daily!

action running scale practices spark cap programming venice open source kafka planetary sql zookeepers gv distributed systems data processing data sovereignty apache kafka data systems duckdb jason hughes apache iceberg cap theorem rocksdb apache pinot michael hunger

What makes Apache Pinot so Fast?

The GeekNarrator

Play Episode Listen Later Nov 16, 2025 59:15

For memberships: join this channel as a member here:https://www.youtube.com/channel/UC_mGuY4g0mggeUGM6V1osdA/joinSummary:In this episode, host Kaivalya Apte interviews Ankit Sultana, a staff engineer at Uber with extensive experience in Apache Pinot, a real-time analytics platform. They discuss the high-level architecture, ingestion processes, and query mechanisms of Apache Pinot. Ankit provides a historical context, detailing the evolution of Apache Pinot from its origins at LinkedIn to its widespread adoption. They discuss the key components of Pinot, explaining the roles of Pinot servers, brokers, controllers, and the dependency on Zookeeper. Ankit also explained how data flows into Apache Pinot and the technicalities of its real-time ingestion and querying capabilities. Chapters:00:00 Introduction and Episode Overview03:30 Understanding Apache Pinot03:49 Apache Pinot's Historical Background05:20 Real-Time Analytics with Apache Pinot11:06 Apache Pinot's Architecture and Components17:05 Tenancy and Data Ingestion in Apache Pinot30:22 Understanding Real-Time Replication and Consumer Groups30:52 Pinot's Offset Tracking and Segment Creation31:59 Handling Server Restarts and Segment Transitions32:50 Dealing with Kafka Duplicates and Deduplication Features35:13 Ingestion Process and Mutable vs Immutable Segments39:18 Memory Management and Segment Flushing40:10 Advantages of Keeping Mutable Segments Longer42:21 Introduction to Pinot's Query Engines42:50 Single Stage Engine: Architecture and Optimizations54:49 Multi-Stage Engine: Flexibility and Challenges58:13 Conclusion and Next StepsImportant Links:* Good high-level overview on Pinot: https://www.youtube.com/watch?v=F8Q_pGIH9yY* Apache Pinot 101 by Tim: https://www.youtube.com/playlist?list=PLihIrF0tCXdfN6y-twj9KtWaXM1GH4RSe* Multistage Physical Optimizer, the new optimizer that we built at Uber and open-sourced: https://docs.pinot.apache.org/users/user-guide-query/multi-stage-query/physical-optimizer* Multistage Lite Mode: https://docs.pinot.apache.org/users/user-guide-query/multi-stage-query/multistage-lite-mode* Time Series Engine Talk at RTA Summit: https://www.youtube.com/watch?v=kgseiambgesFor memberships: join this channel as a member here:https://www.youtube.com/channel/UC_mGuY4g0mggeUGM6V1osdA/joinDon't forget to like, share, and subscribe for more insights!=============================================================================Like building stuff? Try out CodeCrafters and build amazing real world systems like Redis, Kafka, Sqlite. Use the link below to signup and get 40% off on paid subscription.https://app.codecrafters.io/join?via=geeknarrator=============================================================================Database internals series: https://youtu.be/yV_Zp0Mi3xsPopular playlists:Realtime streaming systems: https://www.youtube.com/playlist?list=PLL7QpTxsA4se-mAKKoVOs3VcaP71X_LA-Software Engineering: https://www.youtube.com/playlist?list=PLL7QpTxsA4sf6By03bot5BhKoMgxDUU17Distributed systems and databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4sfLDUnjBJXJGFhhz94jDd_dModern databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4scSeZAsCUXijtnfW5ARlrsNStay Curios! Keep Learning!

uber conclusion chapters architecture advantages real time databases kafka pinot zookeepers keep learning ankit redis tenancy sqlite mutable real time analytics memory management apache pinot

Building Parquet into Apache Pinot ft. Neha Pawar | Ep. 5

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Oct 20, 2025 26:07

Today, Tim Berglund talks to Neha Pawar (StarTree) about her career in real-time analytics and open source database engineering. Her first job: a year-long internship at NVIDIA. Her challenge: leading the technical effort to add native Parquet support into Apache Pinot.SEASON 2 Hosted by Tim Berglund, Adi Polak and Viktor Gamov Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed Music by Coastal Kites Artwork by Phil Vo

nvidia edited neha parquet confluent pawar apache pinot

Cyber Bites - 9th May 2025

Cyber Bites

Play Episode Listen Later May 8, 2025 11:57

* Banks at Risk: Nearly 100 Staff Logins Stolen by Cybercriminals* 'AirBorne' Vulnerabilities Expose Apple Devices to Remote Code Execution Attacks* WhatsApp Introduces 'Private Processing' for Secure Cloud-Based AI Features* Microsoft Warns Default Kubernetes Helm Charts Create Security Vulnerabilities* Security Concerns Grow Over Electric Vehicles as Potential Surveillance PlatformsBanks at Risk: Nearly 100 Staff Logins Stolen by Cybercriminalshttps://www.abc.net.au/news/2025-05-01/bank-employee-data-stolen-with-malware-and-sold-online/105232872Cyber criminals have stolen almost 100 staff logins from Australia's "Big Four" banks, potentially exposing these financial institutions to serious cyber threats including data theft and ransomware attacks, according to recent findings from cyber intelligence firm Hudson Rock.The compromised credentials belong to current and former employees and contractors at ANZ, Commonwealth Bank, NAB, and Westpac, with ANZ and Commonwealth Bank experiencing the highest number of breaches. All stolen credentials included corporate email addresses with access to official bank domains."There are around 100 compromised employees that are related to those four banks," said Hudson Rock analyst Leonid Rozenberg. While this number is significantly smaller than the 31,000 customer banking passwords recently reported stolen, the security implications could be more severe."Technically, [attackers] need only one [login] to do a lot of damage," Rozenberg warned.The credentials were stolen between 2021 and April 2025 using specialized "infostealer" malware designed to harvest sensitive data from infected devices. These stolen credentials have subsequently appeared on Telegram and dark web marketplaces.Security experts explain that these breaches could potentially give hackers "initial access" to the banks' corporate networks. While banks employ additional security measures such as Multi-Factor Authentication (MFA), specialized cybercriminals known as "initial access brokers" focus on finding ways around these protections, often targeting employees working from home.The investigation also uncovered a concerning number of compromised third-party service credentials connected to these banks, with ANZ having more than 100 such breaches and NAB more than 70. These compromised services could include critical communication and project management tools like Slack, JIRA, and Salesforce.All four banks have responded by stating they have multiple safeguards in place to prevent unauthorized access. NAB reports actively scanning cybercrime forums to monitor threats, while CommBank noted investing over $800 million in cybersecurity and financial crime prevention last financial year.The Australian Signals Directorate has already warned that infostealer infections have led to successful attacks on Australian businesses, highlighting that this threat extends beyond the banking sector to organizations across all industries.'AirBorne' Vulnerabilities Expose Apple Devices to Remote Code Execution Attackshttps://www.oligo.security/blog/airborneSecurity researchers at Oligo Security have uncovered a serious set of vulnerabilities in Apple's AirPlay protocol and software development kit (SDK) that could allow attackers to remotely execute code on affected devices without user interaction. These flaws, collectively dubbed "AirBorne," affect millions of Apple and third-party devices worldwide.The security team discovered 23 distinct vulnerabilities that enable various attack vectors, including zero-click and one-click remote code execution, man-in-the-middle attacks, denial of service attacks, and unauthorized access to sensitive information. Perhaps most concerning are two specific flaws (CVE-2025-24252 and CVE-2025-24132) that researchers demonstrated could create "wormable" zero-click attacks, potentially spreading from device to device across networks.Another critical vulnerability (CVE-2025-24206) enables attackers to bypass the "Accept" prompt normally required for AirPlay connections, creating a pathway for truly zero-interaction compromises when combined with other flaws."This means that an attacker can take over certain AirPlay-enabled devices and do things like deploy malware that spreads to devices on any local network the infected device connects to," warned Oligo. "This could lead to the delivery of other sophisticated attacks related to espionage, ransomware, supply-chain attacks, and more."While exploitation is limited to attackers on the same network as vulnerable devices, the potential impact is extensive. Apple reports over 2.35 billion active devices worldwide, and Oligo estimates tens of millions of additional third-party AirPlay-compatible products like speakers, TVs, and car infotainment systems could be affected.Apple released security updates on March 31 to address these vulnerabilities across their product line, including patches for iOS 18.4, iPadOS 18.4, macOS versions (Ventura 13.7.5, Sonoma 14.7.5, and Sequoia 15.4), and visionOS 2.4 for Apple Vision Pro. The company also updated the AirPlay audio and video SDKs and the CarPlay Communication Plug-in.Security experts strongly advise all users to immediately update their Apple devices and any third-party AirPlay-enabled products. Additional protective measures include disabling AirPlay receivers when not in use, restricting AirPlay access to trusted devices via firewall rules, and limiting AirPlay permissions to the current user only.WhatsApp Introduces 'Private Processing' for Secure Cloud-Based AI Featureshttps://engineering.fb.com/2025/04/29/security/whatsapp-private-processing-ai-tools/Meta's WhatsApp has announced a new privacy-focused technology called 'Private Processing' that will allow users to access advanced artificial intelligence features while maintaining data security. The system is designed to enable AI functionalities like message summarization and writing suggestions that are too computationally intensive to run directly on users' devices.The new feature, which will be rolled out gradually over the coming weeks, will be entirely opt-in and disabled by default, giving users complete control over when their data leaves their device for AI processing.Private Processing employs several layers of security to protect user privacy. When activated, the system first performs anonymous authentication through the user's WhatsApp client. It then retrieves public encryption keys from a third-party content delivery network (CDN), ensuring Meta cannot trace requests back to specific individuals.To further enhance privacy, users' devices connect to Meta's gateway through a third-party relay that masks their real IP addresses. The connection establishes a secure session between the user's device and Meta's Trusted Execution Environment (TEE), using remote attestation and TLS protocols.All requests for AI processing use end-to-end encryption with ephemeral keys, and the processing occurs inside a Confidential Virtual Machine (CVM) that remains isolated from Meta's main systems. According to Meta, the processing environment is stateless, with all messages deleted after processing, retaining only "non-sensitive" logs."The AI-generated response is encrypted with a unique key only known to the device and processing CVM and is sent back over the secure session for decryption on the user's device," the company explained.To build trust in the system, WhatsApp has promised to share the CVM binary and portions of the source code for external validation. The company also plans to publish a detailed white paper explaining the secure design principles behind Private Processing.Despite these security measures, privacy experts note that sending sensitive data to cloud servers always carries some inherent risk, even with robust encryption in place. Users concerned about data privacy can either keep the feature disabled or utilize WhatsApp's recently launched 'Advanced Chat Privacy' feature, which provides more granular control over when data can leave the device.Microsoft Warns Default Kubernetes Helm Charts Create Security Vulnerabilitieshttps://techcommunity.microsoft.com/blog/microsoftdefendercloudblog/the-risk-of-default-configuration-how-out-of-the-box-helm-charts-can-breach-your/4409560Microsoft security researchers have issued an urgent warning about significant security risks posed by default configurations in Kubernetes deployments, particularly when using out-of-the-box Helm charts. These configurations can inadvertently expose sensitive data to the public internet without proper authentication protections.According to a new report from Michael Katchinskiy and Yossi Weizman of Microsoft Defender for Cloud Research, many popular Helm charts lack basic security measures, often leaving exploitable ports open and implementing weak or hardcoded passwords that are easy to compromise."Default configurations that lack proper security controls create a severe security threat," the Microsoft researchers warn. "Without carefully reviewing the YAML manifests and Helm charts, organizations may unknowingly deploy services lacking any form of protection, leaving them fully exposed to attackers."Kubernetes has become a widely adopted open-source platform for automating containerized application deployment and management, with Helm serving as its package manager. Helm charts function as templates or blueprints that define resources needed to run applications through YAML files. While these charts offer convenience by simplifying complex deployments, their default settings often prioritize ease of use over security.The report highlights three specific examples demonstrating this widespread issue. Apache Pinot's Helm chart exposes core services through Kubernetes LoadBalancer services with no authentication requirements. Meshery allows public sign-up from exposed IP addresses, potentially giving anyone registration access to cluster operations. Meanwhile, Selenium Grid exposes services across all nodes in a cluster through NodePort, relying solely on external firewall rules for protection.The Selenium Grid vulnerability is particularly concerning as cybersecurity firms including Wiz have already observed attacks targeting misconfigured instances to deploy XMRig miners for cryptocurrency mining.Organizations using Kubernetes are advised to implement several key mitigation strategies. Microsoft recommends thoroughly reviewing default configurations of Helm charts before deployment, ensuring they include proper authentication mechanisms and network isolation. Regular scans for misconfigurations that might publicly expose workload interfaces are crucial, as is continuous monitoring of containers for suspicious activity.The findings underscore a critical tension in cloud deployment between convenience and security, with many users — particularly those inexperienced with cloud security — inadvertently creating vulnerabilities by deploying charts without customizing their security settings.Security Concerns Grow Over Electric Vehicles as Potential Surveillance Platformshttps://www.theguardian.com/environment/2025/apr/29/source-of-data-are-electric-cars-vulnerable-to-cyber-spies-and-hackersCybersecurity experts are raising alarms about the potential for electric vehicles to be exploited as surveillance tools, particularly those manufactured in China, according to recent reports from the UK.British defense firms working with the UK government have reportedly warned staff against connecting their phones to Chinese-made electric cars due to concerns that Beijing could extract sensitive information from their devices. The warning highlights growing security considerations around the increasingly sophisticated technology embedded in modern electric vehicles.Security specialists interviewed by The Guardian note that electric vehicles are equipped with multiple data collection points, including microphones, cameras, and wireless connectivity features that could potentially be leveraged by malicious actors or hostile states."There are lots of opportunities to collect data and therefore lots of opportunities to compromise a vehicle like that," explains Rafe Pilling, director of threat intelligence at cybersecurity firm Secureworks. He points out that over-the-air update capabilities, which allow manufacturers to remotely update a car's operating software, could potentially be used to exfiltrate data.The concerns are particularly focused on individuals in sensitive positions. "If you are an engineer who is working on a sixth-generation fighter jet and you have a work phone that you are connecting to your personal vehicle, you need to be aware that by connecting these devices you could be allowing access to data on your mobile," warns Joseph Jarnecki, a research fellow at the Royal United Services Institute.Chinese electric vehicle manufacturers such as BYD and XPeng have drawn particular scrutiny due to China's National Intelligence Law of 2017, which requires organizations and citizens to cooperate with national intelligence efforts. However, experts also note there is currently no public evidence of Chinese vehicles being used for espionage.Cybersecurity professionals suggest that concerned drivers can click "don't trust" when connecting devices to their vehicles, but this sacrifices many convenient features. They also caution against syncing personal devices with rental cars, as this can leave sensitive data in the vehicle's systems.The UK government has acknowledged the issue, with Defence Minister Lord Coaker stating they are "working with other government departments to understand and mitigate any potential threats to national security from vehicles." He emphasized that their work applies to all types of vehicles, not just those manufactured in China.While the Society of Motor Manufacturers and Traders (SMMT) maintains that all manufacturers selling cars in the UK must adhere to data privacy regulations, the growing integration of connected technologies in electric vehicles continues to raise new security considerations for both government officials and everyday consumers alike. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit edwinkwan.substack.com

Real-Time Analytics... Supercharging AI and Observability with StarTree | Episode #97

Great Things with Great Tech!

Play Episode Listen Later Apr 7, 2025 40:48

Did you know every time you order food, book a ride, or even check who viewed your profile, real-time analytics is powering your experience behind the scenes?In this episode of Great Things with Great Tech, we dive deep into the power of real-time analytics with Kishore Gopalakrishna, CEO and Co-founder of StarTree. StarTree leverages Apache Pinot, a high-performance real-time analytics database, revolutionizing how leading companies like Uber, LinkedIn, Walmart, and Etsy provide instant insights and personalized experiences at massive scale.Kishore shares his journey from a gaming enthusiast fascinated by distributed systems to building mission-critical platforms at Yahoo and LinkedIn, eventually creating Apache Pinot. Discover how StarTree is powering billions of real-time queries per week, enabling businesses to enhance customer interactions, optimize operational decisions, and supercharge modern AI and observability.Key Takeaways: How real-time analytics transform industries, enabling instantaneous insights and rapid decision-making. The evolution from traditional databases to highly efficient columnar, real-time analytics systems. Real-world applications of Apache Pinot, from consumer apps to enterprise observability and operational excellence. How real-time data is accelerating innovations in AI, specifically through Real-Time Retrieval-Augmented Generation (RAG). The future of analytics: seamless data ingestion, enhanced concurrency, and the growing demand for sub-second response times.Links & Resources: Web StarTree: https://startree.ai Kishore Gopalakrishna on LinkedIn: https://www.linkedin.com/in/kgopalak/Apache Pinot: https://pinot.apache.org☑️ Support the Channel: ⁠⁠⁠https://ko-fi.com/gtwgt⁠⁠⁠☑️ Be on #GTwGT: Contact via Twitter @GTwGTPodcast or ⁠⁠visit https://www.gtwgt.com⁠⁠☑️ Subscribe to YouTube: ⁠⁠https://www.youtube.com/@GTwGTPodcast?sub_confirmation=1⁠⁠Check out the full episode on our platforms:Spotify: ⁠⁠https://open.spotify.com/episode/2l9aZpvwhWcdmL0lErpUHC?si=x3YOQw_4Sp-vtdjyroMk3Q⁠⁠Apple Podcasts: ⁠⁠https://podcasts.apple.com/us/podcast/darknet-diaries-with-jack-rhysider-episode-83/id1519439787?i=1000654665731⁠⁠Follow Us:Website: https://gtwgt.comTwitter: https://twitter.com/GTwGTPodcastInstagram: https://instagram.com/GTwGTPodcast☑️ Music: https://www.bensound.com

ceo spotify ai discover real uber walmart yahoo etsy key takeaways great things supercharging observability kishore real time analytics startree apache pinot

AI, Community, and the Future of Generative Applications

Open at Intel

Play Episode Listen Later Nov 27, 2024 20:53

In this engaging conversation at the All Things Open conference, Tim Spann, Principal Developer Advocate at Zilliz, discusses the importance of community collaboration in advancing AI technologies. He emphasizes the need for diverse perspectives in solving complex problems and highlights his work with the Milvus open source vector database. Tim also explains the evolving landscape of retrieval augmented generation (RAG) and its applications and shares insights into the future of AI development. The conversation concludes on a lighter note with Tim describing his creative use of Milvus in a fun Halloween project to catalog and identify ghosts. 00:00 Introduction 00:41 Meet Tim Spann: Principal Developer Advocate 01:35 The Importance of Community in AI 02:56 Advanced RAG and Multimodal Models 06:17 The Future of Agentic RAG 09:04 Challenges and Excitement in AI Development 13:35 Building AI the Right Way 17:50 Fun with AI: Capturing Ghosts 19:24 Conclusion and Final Thoughts Guest: Tim Spann is a Principal Developer Advocate for Zilliz and Milvus. He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal Developer Advocate at Cloudera, Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.

community halloween ai new york city future challenges ms fun bs cloud conclusion spark applications excitement big data iot java generative pivotal team leaders rag hpe developer advocate cloudera trino apache kafka apache spark senior solutions architect hortonworks senior solutions engineer d zone apache flink apache iceberg all things open apache pulsar apache pinot nifi

Learnings from building Open Source Distributed Systems with Kishore Gopalakrishna

The GeekNarrator

Play Episode Listen Later Aug 27, 2024 60:24

In this episode of The Geek Narrator podcast, hosted by Kaivalya Apte, we welcome a special guest, Kishore Gopalakrishna from StarTree, co-author of Apache Pinot and other notable projects. Kishore shares his extensive experience in building real-time analytics and streaming systems, including Apache Pino, Espresso, Apache Helix, and Third Eye. The episode delves into the motivations and challenges behind creating these systems, the innovations they brought to distributed systems, and the impact of community on open-source projects. Kishore also discusses the evolution of testing methodologies, cost optimizations in transactional and analytical systems, and key considerations for companies evaluating real-time analytics solutions. Don't miss this in-depth conversation packed with valuable insights for both seasoned developers and tech enthusiasts! Chapters: 00:00 Introduction 03:13 Building Distributed Systems at LinkedIn 08:57 Testing and Challenges in Distributed Systems 30:50 Advantages of Columnar Storage 33:04 The Importance of Upserts 34:24 Building a Strong Open Source Community 41:10 Challenges and Lessons in System Design 51:35 Real-Time Analytics: Do You Need It? StarTree: https://startree.ai/ Apache Pinot: https://pinot.apache.org/ If you like this episode, please hit the like button and share it with your network. Also please subscribe if you haven't yet. Database internals series: https://youtu.be/yV_Zp0Mi3xs Popular playlists: Realtime streaming systems: https://www.youtube.com/playlist?list=PLL7QpTxsA4se-mAKKoVOs3VcaP71X_LA- Software Engineering: https://www.youtube.com/playlist?list=PLL7QpTxsA4sf6By03bot5BhKoMgxDUU17 Distributed systems and databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4sfLDUnjBJXJGFhhz94jDd_d Modern databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4scSeZAsCUXijtnfW5ARlrsN Stay Curios! Keep Learning! #distributedsystems #kafka #s3 #streaming #realtimeanalytics #database #pinot #startree

lessons challenges building testing chapters advantages learnings real time open source databases espresso third eye keep learning kishore distributed systems startree apache pinot

Episode 28: Real Time Analytics with Apache Pinot and Startree

AWS re:Think Podcast

Play Episode Listen Later Aug 20, 2024 42:50

Companies need to provide real time insights to both customers and internal users. These insights power use cases such as personalization and fraud detection. StarTree Cloud is a real-time analytics platform built on Apache Pinot for building such applications that depend on real time insights. In this episode we meet with Chinmay Soman, Head of Product at Startree.ai to discuss the different dimensions of real-time analytics and how Apache Pinot and StarTree Cloud offer a robust platform for providing such insights to applications.AWS Hosts: Nolan Chen & Malini ChatterjeeEmail Your Feedback: rethinkpodcast@amazon.comResources:StarTree:https://startree.aiStarTree community Slack:https://communityinviter.com/apps/startreedata/startree-communityApache Pinot Slack: https://communityinviter.com/apps/apache-pinot/apache-pinotServerless / Free forever workspace:https://stree.ai/free

head product companies slack cloud computing real time analytics startree apache pinot

Software at Scale 60 - Data Platforms with Aravind Suresh

Software at Scale

Play Episode Listen Later Aug 5, 2024 34:51

Aravind was a Staff Software Engineer at Uber, and currently works at OpenAI.Apple Podcasts | Spotify | Google PodcastsEdited TranscriptCan you tell us about the scale of data Uber was dealing with when you joined in 2018, and how it evolved?When I joined Uber in mid-2018, we were handling a few petabytes of data. The company was going through a significant scaling journey, both in terms of launching in new cities and the corresponding increase in data volume. By the time I left, our data had grown to over an exabyte. To put it in perspective, the amount of data grew by a factor of about 20 in just a three to four-year period.Currently, Uber ingests roughly a petabyte of data daily. This includes some replication, but it's still an enormous amount. About 60-70% of this is raw data, coming directly from online systems or message buses. The rest is derived data sets and model data sets built on top of the raw data.That's an incredible amount of data. What kinds of insights and decisions does this enable for Uber?This scale of data enables a wide range of complex analytics and data-driven decisions. For instance, we can analyze how many concurrent trips we're handling throughout the year globally. This is crucial for determining how many workers and CPUs we need running at any given time to serve trips worldwide.We can also identify trends like the fastest growing cities or seasonal patterns in traffic. The vast amount of historical data allows us to make more accurate predictions and spot long-term trends that might not be visible in shorter time frames.Another key use is identifying anomalous user patterns. For example, we can detect potentially fraudulent activities like a single user account logging in from multiple locations across the globe. We can also analyze user behavior patterns, such as which cities have higher rates of trip cancellations compared to completed trips.These insights don't just inform day-to-day operations; they can lead to key product decisions. For instance, by plotting heat maps of trip coordinates over a year, we could see overlapping patterns that eventually led to the concept of Uber Pool.How does Uber manage real-time versus batch data processing, and what are the trade-offs?We use both offline (batch) and online (real-time) data processing systems, each optimized for different use cases. For real-time analytics, we use tools like Apache Pinot. These systems are optimized for low latency and quick response times, which is crucial for certain applications.For example, our restaurant manager system uses Pinot to provide near-real-time insights. Data flows from the serving stack to Kafka, then to Pinot, where it can be queried quickly. This allows for rapid decision-making based on very recent data.On the other hand, our offline flow uses the Hadoop stack for batch processing. This is where we store and process the bulk of our historical data. It's optimized for throughput – processing large amounts of data over time.The trade-off is that real-time systems are generally 10 to 100 times more expensive than batch systems. They require careful tuning of indexes and partitioning to work efficiently. However, they enable us to answer queries in milliseconds or seconds, whereas batch jobs might take minutes or hours.The choice between batch and real-time depends on the specific use case. We always ask ourselves: Does this really need to be real-time, or can it be done in batch? The answer to this question goes a long way in deciding which approach to use and in building maintainable systems.What challenges come with maintaining such large-scale data systems, especially as they mature?As data systems mature, we face a range of challenges beyond just handling the growing volume of data. One major challenge is the need for additional tools and systems to manage the complexity.For instance, we needed to build tools for data discovery. When you have thousands of tables and hundreds of users, you need a way for people to find the right data for their needs. We built a tool called Data Book at Uber to solve this problem.Governance and compliance are also huge challenges. When you're dealing with sensitive customer data, you need robust systems to enforce data retention policies and handle data deletion requests. This is particularly challenging in a distributed system where data might be replicated across multiple tables and derived data sets.We built an in-house lineage system to track which workloads derive from what data. This is crucial for tasks like deleting specific data across the entire system. It's not just about deleting from one table – you need to track down and update all derived data sets as well.Data deletion itself is a complex process. Because most files in the batch world are kept immutable for efficiency, deleting data often means rewriting entire files. We have to batch these operations and perform them carefully to maintain system performance.Cost optimization is an ongoing challenge. We're constantly looking for ways to make our systems more efficient, whether that's by optimizing our storage formats, improving our query performance, or finding better ways to manage our compute resources.How do you see the future of data infrastructure evolving, especially with recent AI advancements?The rise of AI and particularly generative AI is opening up new dimensions in data infrastructure. One area we're seeing a lot of activity in is vector databases and semantic search capabilities. Traditional keyword-based search is being supplemented or replaced by embedding-based semantic search, which requires new types of databases and indexing strategies.We're also seeing increased demand for real-time processing. As AI models become more integrated into production systems, there's a need to handle more GPUs in the serving flow, which presents its own set of challenges.Another interesting trend is the convergence of traditional data analytics with AI workloads. We're starting to see use cases where people want to perform complex queries that involve both structured data analytics and AI model inference.Overall, I think we're moving towards more integrated, real-time, and AI-aware data infrastructure. The challenge will be balancing the need for advanced capabilities with concerns around cost, efficiency, and maintainability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ai data cost uber software scale traditional openai governance platforms kafka gpus pinot cpus suresh hadoop aravind uber pool staff software engineer apache pinot

Podcasts about apache pinot

Best podcasts about apache pinot

Real-Time Analytics with Tim Berglund

The GeekNarrator

Engenharia de Dados [Cast]

TechCrunch Startups – Spoken Edition

Latest news about apache pinot

Latest podcast episodes about apache pinot

Building Planetary-Scale Data Systems with Venice • Felix GV & Olimpiu Pop

What makes Apache Pinot so Fast?

Building Parquet into Apache Pinot ft. Neha Pawar | Ep. 5

Cyber Bites - 9th May 2025

Real-Time Analytics... Supercharging AI and Observability with StarTree | Episode #97

AI, Community, and the Future of Generative Applications

Learnings from building Open Source Distributed Systems with Kishore Gopalakrishna

Episode 28: Real Time Analytics with Apache Pinot and Startree

Software at Scale 60 - Data Platforms with Aravind Suresh

Testcontainers and Apache Pinot with Tim Veil | Ep. 51

Apache Pinot 1.1!

How Apache Pinot Achieves 200,000 Queries per Second (with Tim Berglund)

Uber & Open-Source: Ujwala Tulshigiri's Insights - Part 2 | Ep. 42

Uber's Scalable Tech Strategy with Ujwala Tulshigiri - Part 1 | Ep. 41

Best of 2023: Navigating Event Streaming with Eric Sammer, Decodable's CEO

Unraveling the Stream: Transactional vs Analytical Processing | Ep. 32

Unveiling the Speed of Star-Tree Index with Sandeep Dabade | Ep. 30

Deep Dive: Exploring StarTree's Advanced Features with Neha Pawar - Part 2 | Ep. 28

Neha Pawar on Apache Pinot's Edge in Real-Time Analytics | Ep. 27

Inside Stripe's Data Revolution with Johan Adami | Ep. 26

Apache Pinot 1.0!

Upserts & Deletes in Apache Pinot: A Discussion with Navina Ramesh | Ep. 19

Stackd 66: Streams, Messages, Events, and a Java User Group

Navigating Event Streaming with Eric Sammer, Decodable's CEO | Ep. 17

A Day in a Life of a Founding Engineer at StarTree: Apache Pinot with Neha Pawar

Tim Berglund on Realtime Analytics with Apache Pinot

Unlocking the Power of Real-Time Analytics • Tim Berglund & Adi Polak

Digging Deep Into Apache Pinot Internals | Ep. 6: ft Rong Rong

Uber, LinkedIn, Pinot and Open Source | Ep. 5: ft Mayank Shrivastava

Mr. Debezium on Pinot, Flink, CDC & Decodable | Ep. 4: Gunnar Morling

How Apache Pinot Began ft. Kishore Gopalakrishna of StarTree | Ep. 2

Kafka, Realtime analytics and Apache Pinot with Tim Berglund Part-2

Kafka, Realtime analytics and Apache Pinot with Tim Berglund Part-1

Enabling User-Facing Analytics using Apache Pinot with Kishore Gopalakrishna

Tiered Storage implementation by StarTree (Apache Pinot) with Neha Pawar

Kishore Gopalakrishna, Co-founder and CEO of StarTree, on Building Real-Time Analytics and Leveraging Community Support

Data analytics startup StarTree secures cash to expand its Apache Pinot–powered platform

Data analytics startup StarTree secures cash to expand its Apache Pinot–powered platform

Running Distributed Systems like a Pro with Mayank Shrivastava

E41: Real-time Analytics Powered by Startree & Apache Pinot

Apache Pinot and Real-Time Analytics with Neha Pawar

Pinot and StarTree with Chinmay Soman

Pinot and StarTree with Chinmay Soman

Speed of Apache Pinot - Cost of Cloud Storage

Accelerate Your Embedded Analytics With Apache Pinot