Podcasts about Apache Hive

  • 22PODCASTS
  • 36EPISODES
  • 54mAVG DURATION
  • ?INFREQUENT EPISODES
  • Jun 11, 2024LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about Apache Hive

Latest podcast episodes about Apache Hive

FeatherCast
Apache Hive 4.0 – Stamatis Zampetakis

FeatherCast

Play Episode Listen Later Jun 11, 2024


Stamatis Zampetakis gives us an overview of what's in the new 4.0 release of Apache Hive, and what's next for the project. Download: https://hive.apache.org/general/downloads/ Project information: https://hive.apache.org/ Prefer video? That's HERE.

project prefer apache hive
Data Radicals
Start with Stories, End with Data with Ashish Thusoo

Data Radicals

Play Episode Listen Later Jun 7, 2023 47:41


As human beings, we're not accustomed to talking about data. In order to learn about new subjects, we traditionally use stories. However, bridging the gap between data and stories allows us to cross that barrier and create data-driven organizations.In this episode, Satyen interviews Ashish Thusoo, GM of AI and ML at AWS. Previously, Ashish was the Founder and CEO of Qubole, a pioneering cloud data lake platform. He also served Facebook as the Engineering Manager of Data Infrastructure where he co-created Apache Hive with the aim to democratize data access and analytics. Satyen and Ashish discuss the accelerated push to the cloud, building a data culture, and how the economic climate is impacting customers.--------“You have to remember, human beings are trained from the get-go to talk about stories, not data. That's how we learn. It takes special discipline to bring the conversation back to data, saying that, ‘You have this anecdote somewhere. Get me the data that proves or disproves it.' That specific mindset has got to be inserted in the organization, and that's how it becomes data-driven. It's a very fine line, but if you cross that line, essentially you become a data-driven organization. But, if you stay on the side of anecdotes and stories, then you can't bridge that.” – Ashish Thusoo--------Time Stamps:*(02:33): The SQL excitement that powered Hive *(13:42): The evolution of Qubole's founder hypothesis *(22:48): Navigating Amazon with AI/ML *(31:41): The future of AI/ML investment*(42:01): People are the foundation of the data culture*(45:57): Satyen's Takeaways--------SponsorThis podcast is presented by Alation.Learn more:* Subscribe to the newsletter: https://www.alation.com/podcast/* Alation's LinkedIn Profile: https://www.linkedin.com/company/alation/* Satyen's LinkedIn Profile: https://www.linkedin.com/in/ssangani/--------LinksFollow Ashish on LinkedInLearn more about AI/ML services on AWS

The Cloud Pod
204: Amazon eats Pi with their own version of S3FS

The Cloud Pod

Play Episode Listen Later Mar 23, 2023 50:38


On this episode of The Cloud Pod, the team discusses Amazon Pi Day, Google's upcoming I/O conference, the agricultural data manager by Microsoft, and the downturn in net profits of Oracle. They also round up cloud migrations by highlighting tools from different cloud service providers that are useful for the process. A big thanks to this week's sponsor, Foghorn Consulting, which provides full-stack cloud solutions with a focus on strategy, planning and execution for enterprises seeking to take advantage of the transformative capabilities of AWS, Google Cloud and Azure. This week's highlights

Trino Community Broadcast
41: Trino puts on its Hudi

Trino Community Broadcast

Play Episode Listen Later Nov 16, 2022 103:00


Trino's initial use case was around replacing the Apache Hive runtime. As data lakes grew into prominence, it became clear that having a faster query engine didn't solve all problems. The Hive model itself was a huge bottleneck and didn't provide features that companies needed akin to data warehouses and databases. Apache Hudi is a new table format created out of Uber that aims to address many of these issues and usher in a new generation of data lake.Tune in as we speak to the Trino Hudi connector contributor Sagar Sumit and Grace Lu who uses Trino and Hudi at Robinhood to discuss the new Hudi connector and future plans!- Intro Music: 0:00- Intro: 0:32- Releases: 14:43- Concept of the episode: Intro to Hudi and the Hudi connector: 22:29- Concept of the episode: Merge on read and copy on write tables: 28:28- Concept of the episode: Hudi metadata table: 39:24- Concept of the episode: Hudi data layout: 46:39- Concept of the episode: Robinhood Trino and Hudi use cases: 51:12- Concept of the episode: Current state and roadmap for the Hudi connector: 1:03:15- Pull request of the episode: PR 14445: Fault-tolerant execution for PostgreSQL and MySQL connectors: 1:08:14- Demo of the episode: Using the Hudi Connector: 1:13:34Show Notes: https://trino.io/episodes/41.htmlShow Page: https://trino.io/broadcast/

Data Gurus
Data Pipelines with Raj Bains | Ep. 183

Data Gurus

Play Episode Listen Later Oct 11, 2022 30:18


Welcome to another engaging and informative episode of Data Gurus! Sima is delighted to have Raj Bains, the CEO and Founder at Prophecy, joining her for today's show! Raj started Prophecy five years ago and decided to build one big product to get data ready for analytics. He talks to Sima about his business and product and where he fits within the industry. Raj's background Raj started his professional career in the early 2000s. He started with graphics, worked as a developer, got into power tools, and worked at Microsoft. After that, he joined a team at NVIDIA to build CUDA, which now gets used for Bitcoin mining. Then he moved into the data space and shifted from engineering into marketing and product management. While selling big data platforms and managing Apache Hive for the Hadoop Company at Hortonworks, he saw data users struggling to be productive with outdated tools. He solved that problem by building power tools to make it easy to get data ready for analytics very quickly. He has focused on doing that since then. Prophecy Raj started Prophecy in 2017 to build a visual tooling layer to get data ready for analytics. Data scientists, data analysts, and data engineers can use the tool to avoid having to do unnecessary work. Standardization There is a big problem with standardization within large organizations when it comes to building data pipelines. The problem with unstandardized codes  In the past, people within the industry used to write scripts. Then they started using standardized visual tools. After that, they moved to the cloud, but nobody wanted to get locked into that tool, so they got rid of the visual development tools and started using codes. The codes within companies are unstandardized, however, so everyone's code looks different. That has led to many different problems. A solution Prophecy's clients do visual drag-and-drop development, and Raj and his team write high-quality code for them. That has opened people up to returning to visual development and allowed much more standardization within the industry. Building a solution Raj and his team started working with a few big credit card companies and banks. Then they spent two or three years building their product. They still have a year or two of building ahead of them, but they have reached a point where they can solve the entire problem for companies using their product. An enterprise standard They have created an enterprise standard for all lines of business, where data analysts, data engineers, and everyone else within a company use the same tool and speak the same language. Data pipelines Prophecy makes it quick and easy for companies to build data pipelines. Data pipelines are essential for analytics because they provide the necessary information for asking intelligent questions. The data has to be high-quality, timely, and in the right shape to answer questions quickly. The future Raj believes that they will solve the issues with data management within the next three to five years. Bio:   Raj is the founder & CEO of Prophecy. Previously, Raj led project management of Apache Hive at Hortonworks through their IPO. He also headed product management and marketing for a NewSQL database startup. Raj continues to actively develop compiler and database technologies in his quest to create data tools “that don't suck.” His engineering roles include developing a NewSQL database, building CUDA at NVIDIA as a founding engineer, and as a compiler engineer working on Microsoft Visual Studio.   Links: Email me your thoughts! Sima@Infinity-2.com LinkedIn Twitter Infinity-2.com   Connect with Raj Raj Bains on LinkedIn Raj Bains on Twitter

Data Leadership Lessons Podcast
Data Tools that Don't Suck with Raj Bains - Episode 99

Data Leadership Lessons Podcast

Play Episode Listen Later Oct 10, 2022 43:00


Watch this episode on YouTube: https://youtu.be/GZVugQeJ_XY This Week's Guest is Data Tools Guru, Raj Bains Raj is the founder & CEO of Prophecy. Previously, Raj led project management of Apache Hive at Hortonworks through their IPO. He also headed product management and marketing for a NewSQL database startup. Raj continues to actively develop compiler and database […]

That Tech Pod
How To Create Data Tools That Don't Suck With Prophecy CEO Raj Bains

That Tech Pod

Play Episode Listen Later Sep 13, 2022 25:38


Today on That Tech Pod, Laura and Gabi talk with Raj Bains. Raj is the founder & CEO of Prophecy. Previously, Raj led project management of Apache Hive at Hortonworks through their IPO. He also headed product management and marketing for a NewSQL database startup. Raj continues to actively develop compiler and database technologies in his quest to create data tools “that don't suck.” His engineering roles include developing a NewSQL database, building CUDA at NVIDIA as a founding engineer, and as a compiler engineer working on Microsoft Visual Studio.

ceo data tools suck prophecy ipo nvidia raj cuda hortonworks microsoft visual studio newsql apache hive raj bains
AWS Podcast
#527: [INTRODUCING] Amazon EMR Serverless

AWS Podcast

Play Episode Listen Later Jun 5, 2022 17:23 Very Popular


Want an easier way to run big data applications in the cloud? Introducing Amazon EMR Serverless, a new deployment option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks such as Apache Spark and Apache Hive without configuring, managing, and scaling clusters or servers. EMR Serverless automatically scales resources up and down to provide just the right amount of capacity for your application. You pay only for what you use and you can minimize concerns about over- or under-provisioning. In this episode, Radhika Ravirala (Principal Product Manager) and Natacha Maheshe (Principal Product Marketing Manager) join Hawn to discuss EMR Serverless use cases, how EMR Serverless addresses customers challenges, and how you can started. Learn more about EMR Serverless - https://go.aws/390CjW1 Read the blog - https://go.aws/38Q86sA Check out the What's New announcement - https://go.aws/39bMWFg Watch the video - https://bit.ly/3NxW0nb

serverless hawn apache spark apache hive amazon emr
Melbourne AWS User Group
What's new in October 2021

Melbourne AWS User Group

Play Episode Listen Later Jan 17, 2022 69:58


A lot of things happened in October, and we talked about them all in early November. In this episode Arjen, Guy, and JM discuss a whole bunch of cool things that were released and may be a bit harsh on everything Microsoft. News Finally in Sydney Amazon EC2 Mac instances are now available in seven additional AWS Regions Amazon MemoryDB for Redis is now available in 11 additional AWS Regions Serverless Lambda AWS Lambda now supports triggering Lambda functions from an Amazon SQS queue in a different account AWS Lambda now supports IAM authentication for Amazon MSK as an event source Step Functions Now — AWS Step Functions Supports 200 AWS Services To Enable Easier Workflow Automation | AWS News Blog AWS Batch adds console support for visualizing AWS Step Functions workflows Amplify Announcing General Availability of Amplify Geo for AWS Amplify AWS Amplify for JavaScript now supports resumable file uploads for Storage Other Accelerating serverless development with AWS SAM Accelerate | AWS Compute Blog Containers Amazon EKS Managed Node Groups adds native support for Bottlerocket AWS Fargate now supports Amazon ECS Windows containers Announcing the general availability of cdk8s and support for Go | Containers Monitoring clock accuracy on AWS Fargate with Amazon ECS Amazon ECS Anywhere now supports GPU-based workloads AWS Console Mobile Application adds support for Amazon Elastic Container Service AWS Load Balancer Controller version 2.3 now available with support for ALB IPv6 targets AWS App Mesh Metric Extension is now generally available EC2 & VPC New – Amazon EC2 C6i Instances Powered by the Latest Generation Intel Xeon Scalable Processors | AWS News Blog Amazon EC2 now supports sharing Amazon Machine Images across AWS Organizations and Organizational Units Amazon EC2 Hibernation adds support for Ubuntu 20.04 LTS Announcing Amazon EC2 Capacity Reservation Fleet a way to easily migrate Amazon EC2 Capacity Reservations across instance types Amazon EC2 Auto Scaling now supports describing Auto Scaling groups using tags Amazon EC2 now offers Microsoft SQL Server on Microsoft Windows Server 2022 AMIs AWS Elastic Beanstalk supports Database Decoupling in an Elastic Beanstalk Environment AWS FPGA developer kit now supports Jumbo frames in virtual ethernet frameworks for Amazon EC2 F1 instances Amazon VPC Flow Logs now supports Apache Parquet, Hive-compatible prefixes and Hourly partitioned files Network Load Balancer now supports TLS 1.3 New – Attribute-Based Instance Type Selection for EC2 Auto Scaling and EC2 Fleet | AWS News Blog Amazon Lightsail now supports AWS CloudFormation for instances, disks and databases Dev & Ops CLI AWS Cloud Control API, a Uniform API to Access AWS & Third-Party Services | AWS News Blog Now programmatically manage alternate contacts on AWS accounts CodeGuru Amazon CodeGuru now includes recommendations powered by Infer Amazon CodeGuru announces Security detectors for Python applications and security analysis powered by Bandit Amazon CodeGuru Reviewer adds detectors for AWS Java SDK v2's best practices and features IaC AWS CDK releases v1.121.0 - v1.125.0 with features for faster development cycles using hotswap deployments and rollback control AWS CloudFormation customers can now manage their applications in AWS Systems Manager Other NoSQL Workbench for Amazon DynamoDB now enables you to import and automatically populate sample data to help build and visualize your data models Amazon Corretto October Quarterly Updates Bulk Editing of OpsItems in AWS Systems Manager OpsCenter AWS Fault Injection Simulator now supports Spot Interruptions AWS Fault Injection Simulator now injects Spot Instance Interruptions Security Firewalls AWS Firewall Manager now supports centralized logging of AWS Network Firewall logs AWS Network Firewall Adds New Configuration Options for Rule Ordering and Default Drop Backups AWS Backup Audit Manager adds compliance reports AWS Backup adds an additional layer for backup protection with the availability of AWS Backup Vault Lock Other AWS Security Hub adds support for cross-Region aggregation of findings to simplify how you evaluate and improve your AWS security posture Amazon SES now supports 2048-bit DKIM keys AWS License Manager now supports Delegated Administrator for Managed entitlements Data Storage & Processing Goodbye Microsoft SQL Server, Hello Babelfish | AWS News Blog Announcing availability of the Babelfish for PostgreSQL open source project Announcing Amazon RDS Custom for Oracle AWS announces AWS Snowcone SSD Amazon RDS Proxy now supports Amazon RDS for MySQL Version 8.0 Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) announces support for Cross-Cluster Replication Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) now comes with an improved management console AWS Transfer Family customers can now use Amazon S3 Access Point aliases for granular and simplified data access controls Amazon EMR now supports Apache Spark SQL to insert data into and update Apache Hive metadata tables when Apache Ranger integration is enabled Amazon Neptune now supports Auto Scaling for Read Replicas AWS Glue Crawlers support Amazon S3 event notifications Amazon Keyspaces (for Apache Cassandra) now supports automatic data expiration by using Time to Live (TTL) settings New – AWS Data Exchange for Amazon Redshift | AWS News Blog AI & ML SageMaker Announcing Fast File Mode for Amazon SageMaker Amazon SageMaker Projects now supports Image Building CI/CD templates Amazon SageMaker Data Wrangler now supports Amazon Athena Workgroups, feature correlation, and customer managed keys Other Amazon Kendra launches support for 34 additional languages Amazon Fraud Detector now supports event datasets AWS announces a price reduction of up to 56% for Amazon Fraud Detector machine learning fraud predictions Amazon Fraud Detector launches new ML model for online transaction fraud detection Amazon Transcribe now supports custom language models for streaming transcription Amazon Textract launches TIFF support and adds asynchronous support for receipts and invoices processing Announcing Amazon EC2 DL1 instances for cost efficient training of deep learning models Other Cool Stuff AWS IoT Core now makes it optional for customers to send the entire trust chain when provisioning devices using Just-in-Time Provisioning and Just-in-Time Registration AWS IoT SiteWise announces support for using the same asset models across different hierarchies VMware Cloud on AWS Outposts Brings VMware SDDC as a Fully Managed Service on Premises | AWS News Blog AWS Outposts adds new CloudWatch dimension for capacity monitoring Amazon Monitron launches iOS app Amazon Braket offers D-Wave's Advantage 4.1 system for quantum annealing Amazon QuickSight adds support for Pixel-Perfect dashboards Amazon WorkMail adds Mobile Device Access Override API and MDM integration capabilities Announcing Amazon WorkSpaces API to create new updated images with latest AWS drivers Computer Vision at the Edge with AWS Panorama | AWS News Blog Amazon Connect launches API to configure hours of operation programmatically New region availability and Graviton2 support now available for Amazon GameLift Sponsors CMD Solutions Silver Sponsors Cevo Versent

AWS Podcast
#428: Accelerate your Analytics with Amazon EMR

AWS Podcast

Play Episode Listen Later Mar 7, 2021 27:09


Big data and analytics play a critical role in every organization’s digital transformation. For many customers, it’s the first initiative they embark on to accelerate time to insight and pace of innovation for their business. Today, Simon is joined by Roy Hasson, Sr. Manager WW Analytics Specialist, to talk about how customers can modernize on-premises, self-managed Apache Hadoop environments running Apache Hive and Apache Spark using Amazon EMR - a fully managed, easy to use, unified analytics platform to run your Hive, Spark, Presto and other big data and ML workloads. They also discuss the EMR Migration Program, developed to help customers quickly modernize their on-premises, self-managed Hadoop environments.

Python Bytes
#198 There's a beaver in your database and Anna-Lena drops by

Python Bytes

Play Episode Listen Later Sep 11, 2020 34:42


Sponsored by us! Support our work through: Our courses at Talk Python Training Test & Code Podcast Special guest: Anna-Lena Popkes Brian #1: Easily create Python scripts using argparse Back in the day, when I was writing most of my utility scripts in bash, I’d keep around an example.bash file with different types of arguments and flags and control structures, etc to use as a template for new scripts. Python has the same problem, or worse, if you use the built in argparse instead of something like click or typer. However, there are many times where you don’t want to have any external dependencies on a script, so built in argparse it is. But I definitely relate to this tweet: “Every time I write a python script, I have to go back to an old script of mine to remember how to set up argparse. For some reason it just does not stick in my mind AT ALL.” - Joshua Schraiber Well, then steps in Ken Youens-Clark with a little utility called new.py. It’s not pip install-able, so you gotta clone it or fork it or copy it or whatever. But it’s cool and fairly simple to hack on yourself, and you’re going to want to make it your own anyway, so that’s fine. You do something like python new.py foo.py and it creates an example starter foo.py for you with: a positional argument a string argument an integer argument a file argument (which also checks to make sure the file is readable) a boolean flag Modify, copy, paste, delete, whatever you want to it now to make it the script you need super fast. Also, add a -t flag to it, like this python new.py -t foo.py, and it generates a test stub to test your new script. Michael #2: DBeaver Database UI Tool via exhuma Remember I mentioned BeeKeeper Free multi-platform database tool for developers, database administrators, analysts and all people who need to work with databases. Supports all popular databases: MySQL, PostgreSQL, SQLite, Oracle, DB2, SQL Server, Sybase, MS Access, Teradata, Firebird, Apache Hive, Phoenix, Presto, etc. Out-of-the box DBeaver supports more than 80 databases. Having usability as its main goal, DBeaver offers: Carefully designed and implemented User Interface Support of Cloud datasources Support for Enterprise security standard Capability to work with various extensions for integration with Excel, Git and others. Multiplatform support Nice UML table/entity diagrams Open source: github.com/dbeaver/dbeaver Based on Eclipse Anna-Lena #3: pdp++ debugger I recently switched from using ipdb to pdb++ Extension of the pdb module of the standard library Fully compatible with pdb but introduces some new features to improve debugging experience Can easily be installed with pip install pdbpp (pdb++ is not a valid package name) Favorites: 1) sticky mode, 2) smart command parsing Sticky mode: “When in this mode, every time the current position changes, the screen is repainted and the whole function shown. Thus, when doing step-by-step execution you can easily follow the flow of the execution.” Smart command parsing: pdb tries to interpret entered commands as one of its builtin commands Inconvenient in some situations Example: printing value of a local variable which happens to have the same name as one of the commands (e.g. c could refer to a local variable but is interpreted as the command ‘continue’) pdb++ solution: in case of ambiguity / if a variable with the same name exists in the scope, it’s preferred To execute the corresponding command, you can prefix it with !! Brian #4: Markdown toys HackMD.io I just found out about HackMD at hackmd.io and I’m quite impressed. “HackMD is a realtime, multi-platform collaborative markdown knowledge base. You can write notes with other people on your desktop, tablet or even on the phone.” Two panel markdown editor with some nice menus to help you remember how to do all the fancy stuff like inserting pictures tables, with all the table options quotes, references, TOC blocks, links, etc. Great for people learning Markdown and for collaborating. Even has fancy addons like math expressions UML Diagrams todo lists And now, sync with github works, so you can edit files that are saved on github. Markdown Guide Just a really good, clean, “… free and open-source reference guide that explains how to use Markdown, the simple and easy-to-use markup language you can use to format virtually any document.” Includes Getting started page Cheat Sheet for super common elements Basic Syntax for more of the details Extended Syntax page Tools with links to lots of tools, including HackMD Michael #5: Python Malware and obfuscation via Connor Ferster Malware is starting to appear that has been written using the Python programming language. Traditionally, most malware has been written in compiled languages, such as C or C++. Uses all the tools we promote for distributing apps: py2exe and py2app (which I used for urlify) Specific examples of Python malware include SeaDuke that was used against the Democratic National Committee back in 2015 and 2016. Lots of interesting tools uncompyle6: The successor to decompyle, uncompyle, and uncompyle2- uncompyle6 is a native Python cross-version decompiler and fragment decompiler. It can be used to translate Python bytecode back into Python source code. pyinstxtractor.py: The PyInstaller Extractor can extract Python data from PyInstaller compiled executables. Detecting Python Compiled Executables: Both PyInstaller and py2exe when compiled on Windows place unique strings within their binary executable. Anna-Lena #6: attrs package What is attrs? → Python package that simplifies writing classes (dunder methods are created automatically) How is this related to dataclasses? PEP 557 added Data Classes to Python 3.7 that resemble attrs in many ways. The PEP was inspired by attrs and is the result of the wish to simplify writing classes without having to deal with the problems of namedtuples Main difference: data classes are less powerful than attrs (certain features were sacrificed for the sake of simplicity) Example: with attrs you can use validators in your initializer that perform some kind of validation of the input arguments (e.g. checking that they have the correct type) Extras: Michael: Was a guest on the Technado show last week. Our Move from Excel to Python and Pandas course is out! Joke: New code quality metric: WTFs/minute

null++: بالعربي
Episode[10][WIT]: Data science with Aya El Dorghamy

null++: بالعربي

Play Episode Listen Later Mar 7, 2020 52:41


Apache Hadoop.HDFS.Apache Hive.Apache Spark.Presto.Architecture Of Giants: Data Stacks At Facebook, Netflix, Airbnb, And Pinterest.Data Wrangling.Null++ Docker Episode.Julia Language.kaggle.SED Podcast, Episode: Slack Data Platform with Josh Wills.Article Software 2.0.Aya's Recommendation for learning:Towards data science.Statistics and Data Science MicroMasters.DataCamp.Udemy: Python for Data Science and Machine Learning Bootcamp.Coursera's Deep Learning Specialization.Lex Fridman Artificial Intelligence Podcast & YouTube channel.Episode Notes:Aya: How To lie with statistics book.Luay: Great Expectations Data Pipeline Testing Framework.Alfy: JAM Stack.

AWS re:Invent 2019
ANT327-R1: Migrating Apache Spark and Hive from on-premises to Amazon EMR

AWS re:Invent 2019

Play Episode Listen Later Dec 7, 2019 49:29


In this session, you hear from representatives of Intuit about how they migrated the company's analytics, data processing (ETL), and data science workloads running on Apache Hive and Spark to Amazon EMR to reduce costs, increase availability, and improve performance. This session focuses on key motivations and the benefits of a move to the cloud and also gives details of key architectural changes and best practices.

Skillbyte Technologie Podcast
Podcast #3: Das Hadoop Ökosystem - Was leistet die bekannteste Big Data Platform?

Skillbyte Technologie Podcast

Play Episode Listen Later Oct 21, 2019 43:33


In diesem Podcast geht es um das Thema: Das Hadoop Ökosystem - Was leistet die bekannteste Big Data Platform? // Inhalt // 1. Was ist Apache Hadoop? 2. Wer hat Hadoop entwickelt? 3. Welche Unternehmen setzen Hadoop ein? 4. Wie funktioniert Hadoop im Detail? 5. Welche Erfahrungen hat Skillbyte mit Apache Hadoop gemacht? Sprungmarken: 01:17 -> Was ist Apache Hadoop? 04:20 -> Für welche Unternehmen ist Hadoop interessant? 07:38 -> Zusammensetzung der Hadoop Distribution 10:17 -> Wie funktioniert Hadoop im Detail? 18:47 -> Apache Hive im Detail 24:40 -> Apache Spark im Detail 31:00 -> Wie können Firmen das Hadoop Ökosystem einsetzen? 34:50 -> So stellen sich Deutsche Unternehmen aktuell im Big Data Umfeld auf 40:20 -> Welche Erfahrungen hat Skillbyte mit Apache Hadoop gemacht? Abonnieren Sie diesen Podcast und besuchen Sie uns auf https://www.skillbyte.de Feedback und Fragen gerne an podcast@skillbyte.de

Roaring Elephant
Episode 114 – Roaring News

Roaring Elephant

Play Episode Listen Later Nov 13, 2018 26:48


In this serving of bite-sized Big Data News we talk about the IBM takeover of Red Hat, a new Botnet going for unprotected Hadoop nodes and a somewhat disappointing Cloudera blog post. IBM To Acquire Red Hat https://investors.redhat.com/news-and-events/press-releases/2018/10-28-2018-184027500 https://newsroom.ibm.com/2018-10-28-IBM-To-Acquire-Red-Hat-Completely-Changing-The-Cloud-Landscape-And-Becoming-Worlds-1-Hybrid-Cloud-Provider New DDoS botnet goes after Hadoop enterprise servers https://www.zdnet.com/article/new-ddos-botnet-goes-after-hadoop-enterprise-servers/ (remember Dr.Who ? https://medium.com/@neerajsabharwal/hadoop-yarn-hack-9a72cc1328b6 ) New in Cloudera Enterprise 6: Apache Hive 2.1 (By the Cloudera Hive Team) http://blog.cloudera.com/blog/2018/10/new-in-cloudera-enterprise-6-apache-hive-2-1/ https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_601_unsupported_features.html#hive_c6_unsupported_features https://hive.apache.org/downloads.html https://issues.apache.org/jira/browse/HIVE-17129 Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 95 – DataWorks Summit in San Jose with Ward Bekker

Roaring Elephant

Play Episode Listen Later Jul 3, 2018 112:50


Since both Dave and Jhon were not able to attend the Dataworks Summit in San Jose a couple of weeks ago, we have a guest, Ward Bekker, who was happy to join and educate us on the subject. DataWorks Summit San Jose 2018 In this episode we discuss the daily keynotes and Wards' selection of sessions at the Summit ranging from the new things in Yarn 3.0, Materialized views in Hive and much more.   Ward Bekker (Linkedin) Pre-Sales Solutions Engineer II @ Hortonworks   Some of the sessions and topics discussed are: Apache Hadoop State of the union https://dataworkssummit.com/san-jose-2018/session/apache-hadoop-yarn-state-of-the-union-2/ What is new in Apache Hive https://dataworkssummit.com/san-jose-2018/session/what-is-new-in-apache-hive/ Runing distributed tensorflow in production https://dataworkssummit.com/san-jose-2018/session/running-distributed-tensorflow-in-production-challenges-and-solutions-on-yarn-3-0-2/ Just the sketch: advanced streaming analytics in Apache Metron https://dataworkssummit.com/san-jose-2018/session/just-the-sketch-advanced-streaming-analytics-in-apache-metron/ Containers and Big Data https://dataworkssummit.com/san-jose-2018/session/containers-and-big-data/ Catch a hacker in realtime: Live visuals of bots and bad guys https://dataworkssummit.com/san-jose-2018/session/catch-a-hacker-in-realtime-live-visuals-of-bots-and-bad-guys/ HDFS tiered storage https://dataworkssummit.com/san-jose-2018/session/hdfs-tiered-storage/ Geospatial data platform at Uber https://dataworkssummit.com/san-jose-2018/session/geospatial-data-platform-at-uber/ What's the Hadoop-la about Kubernetes? https://dataworkssummit.com/san-jose-2018/session/whats-the-hadoop-la-about-kubernetes/ Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

AWS Podcast
#226: Service Update Show

AWS Podcast

Play Episode Listen Later Jan 14, 2018 22:14


Simon reviews some of the updates to AWS services, a new Alexa Skill, a repository of all AWS re:Invent 2017 content as well as a black-belt tip! Shownotes: Amazon Route 53 Releases Auto Naming API for Service Name Management and Discovery | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-route-53-releases-auto-naming-api-name-service-management/ Elasticsearch 6.0 now available on Amazon Elasticsearch Service | https://aws.amazon.com/about-aws/whats-new/2017/12/elasticsearch-6-0-now-available-on-amazon-elasticsearch-service/ Run Petabyte-Scale Clusters on Amazon Elasticsearch Service Using I3 instances | https://aws.amazon.com/about-aws/whats-new/2017/12/run-petabyte-scale-clusters-on-amazon-elasticsearch-service-using-i3-instances/ Encryption at rest now available on Amazon Elasticsearch Service | https://aws.amazon.com/about-aws/whats-new/2017/12/encryption-at-rest-now-available-on-amazon-elasticsearch-service/ AWS Storage Gateway adds upload notifications, MIME-type detection, SUSE Linux support and additional features to File Gateway | https://aws.amazon.com/about-aws/whats-new/2017/11/aws-storage-gateway-adds-upload-notifications-mime-type-detection-suse-linux-support-and-additional-features-to-file-gateway/ Amazon CloudWatch Alarms now alerts you when any M out of N metric datapoints in an interval are above your threshold | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-cloudwatch-alarms-now-alerts-you-when-any-m-out-of-n-metric-datapoints-in-an-interval-are-above-your-threshold/ Amazon CloudWatch now supports two new chart visualization options in metrics and dashboards | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-cloudwatch-now-supports-two-new-chart-visualization-options-in-metrics-and-dashboards/ Amazon CloudWatch Events now Supports AWS CodeBuild as an Event Target | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-cloudwatch-events-now-supports-aws-codebuild-as-an-event-target/ Amazon CloudWatch Logs now Supports KMS Encryption | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-cloudwatch-logs-now-supports-kms-encryption/ New – Amazon CloudWatch Agent with AWS Systems Manager Integration – Unified Metrics & Log Collection for Linux & Windows - AWS News Blog | https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-agent-with-aws-systems-manager-integration-unified-metrics-log-collection-for-linux-windows/ AWS CloudTrail Enhances Event History View and Search | https://aws.amazon.com/about-aws/whats-new/2017/12/aws-cloudtrail-enhances-event-history-view-and-search/ AWS CodePipeline Adds Support for Amazon ECS and AWS Fargate | https://aws.amazon.com/about-aws/whats-new/2017/12/aws-codepipeline-adds-support-for-amazon-ecs-and-aws-fargate/ You Can Now Automate Amazon AppStream 2.0 Agent Updates on your Streaming Instances | https://aws.amazon.com/about-aws/whats-new/2017/12/you-can-now-automate-amazon-appStream2-agent-updates-on-your-streaming-instances/ Amazon AppStream 2.0 Now Supports Tagging to Track Your Resources | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-appstream-2-0-now-supports-tagging-to-track-your-resources/ Easily Apply Amazon Cloud Directory Schema Changes with In-Place Schema Upgrades | https://aws.amazon.com/about-aws/whats-new/2017/12/easily-apply-amazon-cloud-directory-schema-changes-with-in-place-schema-upgrades/ AWS Organizations Enhancements | https://aws.amazon.com/about-aws/whats-new/2017/12/aws-organizations-enhancements/ AWS Single Sign-On – Cloud SSO Service - AWS | https://aws.amazon.com/single-sign-on/ EFS File Sync Now Available in All Amazon EFS Regions | https://aws.amazon.com/about-aws/whats-new/2017/12/efs-file-sync-now-available-in-all-amazon-efs-regions/ Use AWS Budgets advanced filtering options to extend your cost monitoring capabilities | https://aws.amazon.com/about-aws/whats-new/2017/12/use-aws-budgets-advanced-filtering-options-to-extend-your-cost-monitoring-capabilities/ Amazon RDS for PostgreSQL Supports R4, T2 and M4 Instance Types | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-rds-for-postgresql-supports-r4-t2-and-m4-instance-types/ Support for Apache Spark 2.2.1 with Amazon SageMaker integration and Apache Hive 2.3.2 on Amazon EMR release 5.11.0 | https://aws.amazon.com/about-aws/whats-new/2017/12/support-for-apache-spark-2_2_1-with-amazon-sagemaker-integration-and-apache-hive-2_3_2-on-amazon-emr-release-5_11_0/ Amazon ECS Adds ELB Health Check Grace Period | https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-ecs-adds-elb-health-check-grace-period/ AWS Direct Connect Update – Ten New Locations Added in Late 2017 - AWS News Blog | https://aws.amazon.com/blogs/aws/aws-direct-connect-update-ten-new-locations-added-in-late-2017/ AWS Alexa Skill Now Available | https://aws.amazon.com/about-aws/whats-new/2017/11/aws-alexa-skill-now-available/ AWS re:Invent 2017 Sessions & Podcast Feed | http://aws-reinvent-audio.s3-website.us-east-2.amazonaws.com/2017/2017.html Specifying Parameter Values for the AWS Command Line Interface - AWS Command Line Interface | https://docs.aws.amazon.com/cli/latest/userguide/cli-using-param.html AWS Systems Manager Parameter Store - AWS Systems Manager | https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-paramstore.html

search discovery linux aws invent t2 encryption mime podcast feeds elasticsearch alexa skills apache spark amazon sagemaker amazon rds service update aws fargate amazon ecs apache hive suse linux amazon cloudwatch amazon emr amazon appstream amazon elasticsearch service amazon cloudwatch logs aws storage gateway amazon cloudwatch events file gateway
Big Data Beard
Hadoop Community Updates with Alan Gates from Hortonworks

Big Data Beard

Play Episode Listen Later Dec 19, 2017 52:13


Special guest Alan Gates from Hortonworks helps us explore where the Hadoop community is headed in 2018 and what's new with Apache Hive. We also take time to dive into Alan's background at working with Hadoop, Hive, and Pig in the early days at Yahoo. Show Notes: http://bit.ly/BDB_AGH

AWS re:Invent 2017
STG312: Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber

AWS re:Invent 2017

Play Episode Listen Later Nov 30, 2017 61:28


Learn how to build a data lake for analytics in Amazon S3 and Amazon Glacier. In this session, we discuss best practices for data curation, normalization, and analysis on Amazon object storage services. We examine ways to reduce or eliminate costly extract, transform, and load (ETL) processes using query-in-place technology, such as Amazon Athena and Amazon Redshift Spectrum. We also review custom analytics integration using Apache Spark, Apache Hive, Presto, and other technologies in Amazon EMR. You'll also get a chance to hear from Airbnb & Viber about their solutions for Big Data analytics using S3 as a data lake.

Roaring Elephant
Episode 48 – Alan Gates on the DataWorks Summit (Part 3)

Roaring Elephant

Play Episode Listen Later Aug 1, 2017 35:36


In this third part of our interview with Alan Gates, PMC member for various Apache projects including Apache Hive and co-founder of Hortonworks, we talk about his sessions at the DataWorks Summits and about the Summits in general. [Image taken from Linux.com] 00:00 Recent events Since both Dave and Jhon are still on vacation, this episode was pre-recorded ahead of time. Because of this, we do not have any recent events to talk about. 02:38 Alan Gates on the DataWorks Summit (Part 3) Since this part of the interview goes public after the San Jose Summit, it is too late to submit abstracts for that particular summit. However, the Australian version is in a couple of months so please go to the DataWorks website for more information about that one. 35:35 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 44 – Suicidal Spark

Roaring Elephant

Play Episode Listen Later Jun 6, 2017 71:17


In this episode we're joined by Youen Chéné and Aurélien Vandel from Saagie who talk to us about their experiences deploying Spark Streaming workloads in production (based on their Dataworks Summit talk), what worked well, what didn't and what they'd recommend you might want to do if you follow in their footsteps.   Enjoy! 00:00 Recent events Dave Big Data Videos http://www.kdnuggets.com/2017/05/top-recent-big-data-videos-youtube.html https://www.youtube.com/watch?v=RQ9czRAdmMs https://www.youtube.com/watch?v=hsoKlE67rTw Jhon InsightOut: The role of Apache Atlas in the open metadata ecosystem http://www.ibmbigdatahub.com/blog/insightout-role-apache-atlas-open-metadata-ecosystem https://www.youtube.com/watch?v=yQvmoDtGgbo Apache Atlas API Version 2 https://atlas.incubator.apache.org/api/v2/index.html Cloud giants 'ran out' of fast GPUs for AI boffins https://www.theregister.co.uk/2017/05/22/cloud_providers_ai_researchers/ Benchmark: Sub-Second Analytics with Apache Hive and Druid https://hortonworks.com/blog/sub-second-analytics-hive-druid/ 26:00 Spark Streaming and Suicidal Tendencies https://dataworkssummit.com/munich-2017/sessions/spark-streaming-and-suicidal-tendencies/ Video: https://www.youtube.com/watch?v=Us8kizlbJtc Slides: https://www.slideshare.net/HadoopSummit/spark-streaming-and-suicidal-tendencies Youen Chéné, CTO @Saagie https://www.linkedin.com/in/youenchene/ Aurélien Vandel, Data Engineer https://www.linkedin.com/in/aur%C3%A9lien-vandel-060b5b8a/ 01:11:17 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 43 – Alan Gates talks Hive (Part 2)

Roaring Elephant

Play Episode Listen Later May 23, 2017 54:24


In this episode we discuss the maturity of the Hadoop ecosystem and how hard it currently still is to get the value out of data. In the main section, we will have the second part of the interview with Alan Gates, this time talking about the place Hive has in the ecosystem. We still have more from Alan so stay tuned for more Hive goodness in future episodes! 00:00 Recent events Dave PredictionIO 0.11 release https://github.com/apache/incubator-predictionio/blob/v0.11.0-incubating/RELEASE.md http://predictionio.incubator.apache.org/ http://predictionio.incubator.apache.org/start/ http://predictionio.incubator.apache.org/system/ http://predictionio.incubator.apache.org/gallery/template-gallery/ https://techcrunch.com/2016/02/19/salesforce-acquires-predictionio-to-build-up-its-machine-learning-muscle/ Jhon Ultra-fast OLAP Analytics with Apache Hive and Druid – Part 1 of 3 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ Why Big Data Hasn’t Yet Made a Dent on Farms https://www.wsj.com/articles/why-big-data-hasnt-yet-made-a-dent-on-farms-1494813720 The Pentagon is hunting ISIS using big data and machine learning https://www.engadget.com/2017/05/15/the-pentagon-is-hunting-isis-using-big-data-and-machine-learning/ 34:00 Alan Gates talks Hive (Part 2) Here is the second part of our interview with Alan Gates where he talks about Apache Hive. 54:24 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 42 – Alan Gates talks Hive (Part 1)

Roaring Elephant

Play Episode Listen Later May 9, 2017 64:13


Welcome to the life the universe and everything episode of the Roaring Elephant Podcast. We talk some news and this episode got a little bit ranty... Apologies for that; to balance it out we have a chat with Alan Gates talking about Hive for you.   There was so much Alan Gates goodness, we've split it over a few sessions and here's part one...   07:00 Recent events Dave Metron graduates to Apache TLP status https://blogs.apache.org/foundation/entry/apache-software-foundation-announces-apache https://hortonworks.com/blog/congratulations-apache-metron-tlp/ 2017 Big Data Landscape https://www.linkedin.com/pulse/firing-all-cylinders-2017-big-data-landscape-matt-turck You’re doing Hadoop and Spark wrong and they will probably fail https://www.theregister.co.uk/2017/02/21/hadoop_and_spark_risks_and_opportunities/ Jhon Apache Impala Leads Traditional Analytic Database http://blog.cloudera.com/blog/2017/04/apache-impala-leads-traditional-analytic-database/ Cloudera Data Science Workbench blogs http://blog.cloudera.com/blog/2017/04/use-your-favorite-python-library-on-pyspark-cluster-with-cloudera-data-science-workbench/ http://blog.cloudera.com/blog/2017/04/deep-learning-frameworks-on-cdh-and-cloudera-data-science-workbench/ http://blog.cloudera.com/blog/2017/04/bigdl-on-cdh-and-cloudera-data-science-workbench/ 34:00 Alan Gates talks Hive In this first part of our interview with Alan Gates where he talks about Apache Hive. 01:04:12 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 39 – Dataworks Summit Europe – Day 1

Roaring Elephant

Play Episode Listen Later Apr 5, 2017 94:31


In this episode of the Roaring Elephant podcast, Dave and I attend the Dataworks summit, meet listeners, sit in on sessions and give our overall view of the day! It's the next best thing to being here.   If you ARE here, then look out for us, we'll exchange limited edition Roaring Elephant stickers for audio clips. 00:00 Intro Roaring Elephant Roadshow Day 1- Direct from Munich! 03:25 Session Discussions Our review of the sessions, what we liked, what we learned, what we'd recommend you go and check out afterwards: Keynote An Apache Hive Based Data Warehouse Interactive Analytics at Scale in Apache Hive using Druid Hadoop 3.0 in a Nutshell Running Services on YARN Streamline - Stream Analytics for Everyone (AKA SAM: Streaming Analytics Manager) Apache Atlas: Governance for your Data File Format Benchmark - Avro, JSON, ORC and Parquet An Approach for Multi-Tenancy through Apache Knox 01:27:00 Wrap up Some final overall observations and looking forward to day 2! 01:34:31 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

AWS re:Invent 2016
BDM401: Deep Dive: Amazon EMR Best Practices & Design Patterns

AWS re:Invent 2016

Play Episode Listen Later Dec 24, 2016 57:00


Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features. This session will feature Asurion, a provider of device protection and support services for over 280 million smartphones and other consumer electronics devices. Asurion will share how they architected their petabyte-scale data platform using Apache Hive, Apache Spark, and Presto on Amazon EMR.

Roaring Elephant
Episode 14 – Hadoop Summit – Retrospective

Roaring Elephant

Play Episode Listen Later Apr 26, 2016 51:47


After the last two special edition episodes where we quickly covered each Summit day in a "same-day" episode, we go over the full event in this episode, highlighting the sessions we enjoyed the most and sharing our general feelings about the 2016 Hadoop Summit in Dublin.   00:00 Recent events Summit! Sessions on youtube Meetings and planning, Apache Metron https://cwiki.apache.org/confluence/display/METRON/Metron+Wiki https://community.hortonworks.com/articles/26047/apche-metron-tp1-blog-series.html Setting up a new podcast recording "studio" 09:00 Hadoop Summit - Retrospective Summit Schedule App Hortonworks emphasising  Streaming ingest using Nifi, but the other talks did not so much Summit video sessions are starting to appear online https://www.youtube.com/channel/UCAPa-K_rhylDZAUHVxqqsRA/videos Next year: Munich Day one sessions: It's not the size of your cluster, It's how you use it Big Fish - David Darden & Don Smith Unified stream and batch processing with Apache Flink Artisans Gmbh - Ufuk Celebi Taming the Elephant  Hortonworks - Paul Codding How To: A beginners guide to becoming an apache contributor Teradata - Venkatesh  On-Demand HDP Clusters using Cloudbreak and Ambari Symantec - Karthik Karuppaiya & Narendra Bidari Machine Learning in Big Data - Look Forward or be left behind Redpoint Global Inc - Bill Porto Past, Present, Future of hadoop at LinkedIn LinkedIn - Carl Steinbach Migrating Hundreds of Pipelines in Docker Containers Spotify - Noa Resare Day two sessions: MLLeap: Or how to Productionize Data science workflows using Spark Shift Technologies - Mihkail Semenluk & TrueCar - Hollin Wilkins Scaling out to 10 Clusters, 1000 Users, and 10,000 Flows: The Dali Experience at LinkedIn Carl Steinbach, LinkedIn Hadoop Platform at Yahoo: A Year in Review Sumeet Singh, Yahoo!, Inc. Apache Hive 2.0 SQL Speed Scale Hortonworks - Allen Gates Telematics with Hadoop and Nifi Adam Morton, Admiral Insurance - Simon Elliston Ball, Hortonworks Apache Eagle - Monitor Hadoop in Real-Time Ebay - Young Zang & Arun manoharan 43:18 Questions from our Listeners Great question in from Rene about small businesses and Big Data which we'll cover on a future episode! Also Rene's feedback has helped us tweak the feedback form so it's easier to use. Is this a vendor podcast? No, we're all community! :o) How do you record the podcast, what is your equipment? Skype-saurus: the original, expensive hardware solution. http://www.leoville.com/the-skypesaurus-story (Sadly, this no longer seems to be available anymore.) Skype-o-saurus: a cheaper solution using an OS-X agregate sound device. https://drupalize.me/blog/201504/recording-podcasts-creating-skype-o-saurus   51:48 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 14 – Hadoop Summit – Retrospective

Roaring Elephant

Play Episode Listen Later Apr 26, 2016 51:47


After the last two special edition episodes where we quickly covered each Summit day in a "same-day" episode, we go over the full event in this episode, highlighting the sessions we enjoyed the most and sharing our general feelings about the 2016 Hadoop Summit in Dublin.   00:00 Recent events Summit! Sessions on youtube Meetings and planning, Apache Metron https://cwiki.apache.org/confluence/display/METRON/Metron+Wiki https://community.hortonworks.com/articles/26047/apche-metron-tp1-blog-series.html Setting up a new podcast recording "studio" 09:00 Hadoop Summit - Retrospective Summit Schedule App Hortonworks emphasising  Streaming ingest using Nifi, but the other talks did not so much Summit video sessions are starting to appear online https://www.youtube.com/channel/UCAPa-K_rhylDZAUHVxqqsRA/videos Next year: Munich Day one sessions: It's not the size of your cluster, It's how you use it Big Fish - David Darden & Don Smith Unified stream and batch processing with Apache Flink Artisans Gmbh - Ufuk Celebi Taming the Elephant  Hortonworks - Paul Codding How To: A beginners guide to becoming an apache contributor Teradata - Venkatesh  On-Demand HDP Clusters using Cloudbreak and Ambari Symantec - Karthik Karuppaiya & Narendra Bidari Machine Learning in Big Data - Look Forward or be left behind Redpoint Global Inc - Bill Porto Past, Present, Future of hadoop at LinkedIn LinkedIn - Carl Steinbach Migrating Hundreds of Pipelines in Docker Containers Spotify - Noa Resare Day two sessions: MLLeap: Or how to Productionize Data science workflows using Spark Shift Technologies - Mihkail Semenluk & TrueCar - Hollin Wilkins Scaling out to 10 Clusters, 1000 Users, and 10,000 Flows: The Dali Experience at LinkedIn Carl Steinbach, LinkedIn Hadoop Platform at Yahoo: A Year in Review Sumeet Singh, Yahoo!, Inc. Apache Hive 2.0 SQL Speed Scale Hortonworks - Allen Gates Telematics with Hadoop and Nifi Adam Morton, Admiral Insurance - Simon Elliston Ball, Hortonworks Apache Eagle - Monitor Hadoop in Real-Time Ebay - Young Zang & Arun manoharan 43:18 Questions from our Listeners Great question in from Rene about small businesses and Big Data which we’ll cover on a future episode! Also Rene's feedback has helped us tweak the feedback form so it’s easier to use. Is this a vendor podcast? No, we’re all community! :o) How do you record the podcast, what is your equipment? Skype-saurus: the original, expensive hardware solution. http://www.leoville.com/the-skypesaurus-story (Sadly, this no longer seems to be available anymore.) Skype-o-saurus: a cheaper solution using an OS-X agregate sound device. https://drupalize.me/blog/201504/recording-podcasts-creating-skype-o-saurus   51:48 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 14 – Hadoop Summit – Retrospective

Roaring Elephant

Play Episode Listen Later Apr 26, 2016 51:47


After the last two special edition episodes where we quickly covered each Summit day in a "same-day" episode, we go over the full event in this episode, highlighting the sessions we enjoyed the most and sharing our general feelings about the 2016 Hadoop Summit in Dublin.   00:00 Recent events Summit! Sessions on youtube Meetings and planning, Apache Metron https://cwiki.apache.org/confluence/display/METRON/Metron+Wiki https://community.hortonworks.com/articles/26047/apche-metron-tp1-blog-series.html Setting up a new podcast recording "studio" 09:00 Hadoop Summit - Retrospective Summit Schedule App Hortonworks emphasising  Streaming ingest using Nifi, but the other talks did not so much Summit video sessions are starting to appear online https://www.youtube.com/channel/UCAPa-K_rhylDZAUHVxqqsRA/videos Next year: Munich Day one sessions: It's not the size of your cluster, It's how you use it Big Fish - David Darden & Don Smith Unified stream and batch processing with Apache Flink Artisans Gmbh - Ufuk Celebi Taming the Elephant  Hortonworks - Paul Codding How To: A beginners guide to becoming an apache contributor Teradata - Venkatesh  On-Demand HDP Clusters using Cloudbreak and Ambari Symantec - Karthik Karuppaiya & Narendra Bidari Machine Learning in Big Data - Look Forward or be left behind Redpoint Global Inc - Bill Porto Past, Present, Future of hadoop at LinkedIn LinkedIn - Carl Steinbach Migrating Hundreds of Pipelines in Docker Containers Spotify - Noa Resare Day two sessions: MLLeap: Or how to Productionize Data science workflows using Spark Shift Technologies - Mihkail Semenluk & TrueCar - Hollin Wilkins Scaling out to 10 Clusters, 1000 Users, and 10,000 Flows: The Dali Experience at LinkedIn Carl Steinbach, LinkedIn Hadoop Platform at Yahoo: A Year in Review Sumeet Singh, Yahoo!, Inc. Apache Hive 2.0 SQL Speed Scale Hortonworks - Allen Gates Telematics with Hadoop and Nifi Adam Morton, Admiral Insurance - Simon Elliston Ball, Hortonworks Apache Eagle - Monitor Hadoop in Real-Time Ebay - Young Zang & Arun manoharan 43:18 Questions from our Listeners Great question in from Rene about small businesses and Big Data which we’ll cover on a future episode! Also Rene's feedback has helped us tweak the feedback form so it’s easier to use. Is this a vendor podcast? No, we’re all community! :o) How do you record the podcast, what is your equipment? Skype-saurus: the original, expensive hardware solution. http://www.leoville.com/the-skypesaurus-story (Sadly, this no longer seems to be available anymore.) Skype-o-saurus: a cheaper solution using an OS-X agregate sound device. https://drupalize.me/blog/201504/recording-podcasts-creating-skype-o-saurus   51:48 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 9 – SQL in Hadoop

Roaring Elephant

Play Episode Listen Later Mar 8, 2016 53:38


SQL was one of the first data access methods added to vanilla Hadoop. Considering that the many of the people working with Hadoop in the early days came from a database background, this is not surprising. Since then, the SQL ecosystem in Hadoop has grown considerably and in this episode we do a general overview of many of the available choices.This episode runs a bit longer than normal but we hope you'll find it worthwhile! 00:00 Recent events Spark masterclasses NiFi on trains Mifid II and the active archive World Mobile Congress 08:30 Main Topic SQL solutions: Apache Hive https://hive.apache.org/ Apache Spark Sql http://spark.apache.org/sql/ Apache Phoenix https://phoenix.apache.org/ Apache Impala (incubating) https://www.cloudera.com/products/apache-hadoop/impala.html Apache Hawq (incubating) http://hawq.incubator.apache.org/ Apache Drill https://drill.apache.org/ Presto https://prestodb.io/ Oracle Big Data Sql http://www.oracle.com/us/products/database/big-data-sql/overview/index.html IBM BigSql http://www-01.ibm.com/software/data/infosphere/hadoop/big-sql.html Technology topics: JDBC/ODBC SQL syntax compliance Multi-user concurrency Benchmarks 46:40 Questions from our Listeners: How much storage overhead should I count on if I add SQL in my Hadoop workflow? How do I make my sql faster? 53:38 End     Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 9 – SQL in Hadoop

Roaring Elephant

Play Episode Listen Later Mar 8, 2016 53:38


SQL was one of the first data access methods added to vanilla Hadoop. Considering that the many of the people working with Hadoop in the early days came from a database background, this is not surprising. Since then, the SQL ecosystem in Hadoop has grown considerably and in this episode we do a general overview of many of the available choices.This episode runs a bit longer than normal but we hope you'll find it worthwhile! 00:00 Recent events Spark masterclasses NiFi on trains Mifid II and the active archive World Mobile Congress 08:30 Main Topic SQL solutions: Apache Hive https://hive.apache.org/ Apache Spark Sql http://spark.apache.org/sql/ Apache Phoenix https://phoenix.apache.org/ Apache Impala (incubating) https://www.cloudera.com/products/apache-hadoop/impala.html Apache Hawq (incubating) http://hawq.incubator.apache.org/ Apache Drill https://drill.apache.org/ Presto https://prestodb.io/ Oracle Big Data Sql http://www.oracle.com/us/products/database/big-data-sql/overview/index.html IBM BigSql http://www-01.ibm.com/software/data/infosphere/hadoop/big-sql.html Technology topics: JDBC/ODBC SQL syntax compliance Multi-user concurrency Benchmarks 46:40 Questions from our Listeners: How much storage overhead should I count on if I add SQL in my Hadoop workflow? How do I make my sql faster? 53:38 End     Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Roaring Elephant
Episode 9 – SQL in Hadoop

Roaring Elephant

Play Episode Listen Later Mar 8, 2016 53:38


SQL was one of the first data access methods added to vanilla Hadoop. Considering that the many of the people working with Hadoop in the early days came from a database background, this is not surprising. Since then, the SQL ecosystem in Hadoop has grown considerably and in this episode we do a general overview of many of the available choices.This episode runs a bit longer than normal but we hope you'll find it worthwhile! 00:00 Recent events Spark masterclasses NiFi on trains Mifid II and the active archive World Mobile Congress 08:30 Main Topic SQL solutions: Apache Hive https://hive.apache.org/ Apache Spark Sql http://spark.apache.org/sql/ Apache Phoenix https://phoenix.apache.org/ Apache Impala (incubating) https://www.cloudera.com/products/apache-hadoop/impala.html Apache Hawq (incubating) http://hawq.incubator.apache.org/ Apache Drill https://drill.apache.org/ Presto https://prestodb.io/ Oracle Big Data Sql http://www.oracle.com/us/products/database/big-data-sql/overview/index.html IBM BigSql http://www-01.ibm.com/software/data/infosphere/hadoop/big-sql.html Technology topics: JDBC/ODBC SQL syntax compliance Multi-user concurrency Benchmarks 46:40 Questions from our Listeners: How much storage overhead should I count on if I add SQL in my Hadoop workflow? How do I make my sql faster? 53:38 End     Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

Les Cast Codeurs Podcast
LCC 115 - Interview de Sam Bessalah sur la data science, Hadoop et Mesos

Les Cast Codeurs Podcast

Play Episode Listen Later Dec 22, 2014 72:20


Dans cet épisose, on discute avec Sam Bessalah de ce “nouveau” métier qu’est le data scientist. On explore aussi l’univers Apache Hadoop et l’univers Apache Mesos. Ces endroits sont pleins de projets aux noms bizarres, cette interview permet de s’y retrouver un peu dans cette mythologie. Enregistré le 16 decembre 2014 Téléchargement de l’épisode LesCastCodeurs-Episode–115.mp3 Interview Ta vie, ton oeuvre @samklr Ses présentations, encore ici et là Data scientist Kesako ?! C’est nouveau ? On a toujours eu des données pourtant dans nos S.I. ?! Le job le plus sexy du 21eme siecle ? Drew conway’s Data Science Venn diagram Traiter les données, les plateformes MapR, Hadoop, … C’est Quoi ? C’est nouveau ? Ca vient d’où ? Comment ça marche ? A quoi ça sert ? Ca s’intègre à tout ? Et nos sources de données legacy (Mon bon vieux mainframe et son EBCDIC) ? Où sont passés mes EAI, ETL, et autres outils d’intégration B2C/B2B ? EAI ETL EBCDIC BI (Business Intelligence) Hadoop MapReduce Doug Cutting Apache Lucene - moteur de recherche full-text Apache Hadoop - platforme de process distribués et scalables HDFS - système de fichier distribué Apache Hive - datawarehouse au dessus d’Hadoop offrant du SQL-like Terradata Impala - database analytique (“real time”) SQL queries etc Apache Tez - directed-acyclic-graph of tasks Apache Shark remplacé par Spark SQL Apache Spark - Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing Apache Storm - process de flux de données de manière scalable et distribuée Data Flow Machine Learning - apprendre de la donnée Graph Lab Et l’infrastructure dans tout ça ? De nos bons vieux serveurs qui remplissent les salles machines au cloud (IAAS, PAAS), en passant par la virtualisation (), les conteneurs (XLC, Docker, …) …. Des ressources à gogo c’est bien mais comment les gérer ? YARN Apache Mesos Apache Mesos Comment démarrer Mesos Tutoriaux Data Center OS de Mesosphere Presentation de Same à Devoxx sur Mesos Mesos et les container docker Cluster Management and Containerization by Benjamin Hindman Integration continue avec Mesos par EBays Docker Docker Démarrer un cluster Spark avec Docker Shell Spark dans Docker Docker et Kubernetes dans Apache Hadoop YARN Cluster Hadoop sur Docker Docker, Kubernetes and Mesos cgroups LXC Docker vs LXC Marathon Chronos Code de Chronos Aurora Kubernetes Kubernetes workshop Oscar Boykin Scalding Présentation Scala + BigData et une autre Apache Ambari Comment je m’y mets ? Comment devient-on data scientist ? (se former, ouvrages de références, sources d’infos, …) Mesosphere Cours de Andrew Ng sur le Machine Learning Introduction to data science sur Coursera Kaggle MLlib Mahoot R Scikit-learn (Python) Machine Learning pour Hackers (livre) Scala TypeSafe Activator iPython NoteBooks Autres référence iPython NoteBooks Notebooks temporaires en line - démarre un container docker sur rackspace gratuitement (pour vous) Des notebooks Parallel Machine Learning with scikit-learn and IPython Visualiser les notebooks en ligne sans les télécharger Spark / Scala notebooks for web based spark development http://zeppelin-project.org/ Spark et Scala avec un notebook ipython Nous contacter Contactez-nous via twitter http://twitter.com/lescastcodeurs sur le groupe Google http://groups.google.com/group/lescastcodeurs ou sur le site web http://lescastcodeurs.com/ Flattr-ez nous (dons) sur http://lescastcodeurs.com/ En savoir plus sur le sponsoring? sponsors@lescastcodeurs.com

Lightbend
Scala Days 2014 - “Catalyst: A Functional Query Optimizer for Spark and Shark" - Michael Armbrust

Lightbend

Play Episode Listen Later Aug 16, 2014 43:45


Shark is a SQL engine built on Apache Hive, that replaces Hive's MapReduce execution engine with Apache Spark. Spark's fine-grained resource model and efficient execution engine allow Shark to outperform Hive by over 100x for data stored in memory. However, until now, Shark's performance has been limited by the flexibility of Hive's query optimizer. Catalyst aims to remedy this situation by building a simple yet powerful optimization framework using Scala language features. Query optimization can greatly improve both the productivity of developers and the performance of the queries that they write. A good query optimizer is capable of automatically rewriting relational queries to execute more efficiently, using techniques such as filtering data early, utilizing available indexes, and even ensuring different data sources are joined in the most efficient order. By performing these transformations, the optimizer not only improves the execution times of relational queries, but also frees the developer to focus on the semantics of their application instead of its performance. Unfortunately, building an optimizer is an incredibly complex engineering task and thus many open source systems perform only very simple optimizations. Past research [1,2] has attempted to combat this complexity by providing frameworks that allow the creators of optimizers to write possible optimizations as a set of declarative rules. However, the use of such frameworks has required the creation and maintenance of special “optimizer compilers” and forced the burden of learning a complex domain specific language upon those wishing to add features to the optimizer. Catalyst solves this problem by leveraging Scala's powerful pattern matching and runtime reflection. This framework allows developers to concisely specify complex optimizations, such as pushing filters past joins functionally. Increased conciseness allows our developers both to create new optimizations faster and more easily reason about the correctness of the optimization. Catalyst also uses the new reflection capabilities in Scala 2.10 to generate custom classes at runtime for storing intermediate results and evaluating complex relational expressions. Doing so allows us to avoid boxing of primitive values and has been shown to improve performance by orders of magnitude in some cases. [1] Graefe, G. The Cascades Framework for Query Optimization. In Data Engineering Bulletin. Sept. 1995. [2] Goetz Graefe , David J. DeWitt, The EXODUS optimizer generator, Proceedings of the 1987 ACM SIGMOD international conference on Management of data, p.160-172, May 27-29, 1987, San Francisco, California, United States Video Available on www.parleys.com.

Rebuild
52: TLDR Driven Development (Naoya Ito)

Rebuild

Play Episode Listen Later Aug 2, 2014 48:07


Naoya Itoさんをゲストに迎えて、開発合宿、ノマド、Terraform、README、ポエム、MPP, BigQuery などについて話しました。 Show Notes ハッカーズチャンプルー2014 リゾートワーク HikaruStar Venture Camp airbnb NomadList Terraform AWS CloudFormation kumogata わかりやすいREADME.mdを書く Readme Driven Development Design Documents Working Backwards PDD poem-driven development Google BigQuery MPP on Hadoop, RedShift, BigQuery Hadoop Conference Japan 2014 Apache Hive Google BigQueryでDWH構築 Rebuild One-Year Anniversary

Enterprise Java Newscast
Episode 15 - May 2013

Enterprise Java Newscast

Play Episode Listen Later May 31, 2013 111:52


Kito, Ian, and Daniel cover new releases from Apache, PrimeFaces, SpringSource, ICEsoft, JBoss, IBM, Oracle, Google, and more. They also discuss the new Google Android IDE and SpringSource's new Reactor asynchronous framework. New Releases PrimeFaces PrimeFaces Themes 1.0.10 Released PrimeFaces 3.5.3 Released Responsive Interportlet Communication PrimeFaces Mobile 0.9.4 Released PrimeFaces 3.4.5 released PrimeFaces Extensions 0.7 released - new Timeline component PrimeFaces 3.5.4 released ICEsoft ICEpdf 5 released ICEfaces 3.3 released, with new ACE components Apache MEDIA ALERT: The Apache Struts Project Announces Apache Struts™ 1 End-Of-Life Apache Syncope 1.1.1 released Apache CouchDB 1.3.0 released Apache Wookie 0.14 Release Apache PDFBox 1.8.1 released Apache Wink 1.3.0 release Apache Bloodhound 0.5.3 Released Apache MRUnit 1.0.0 released Apache OpenNLP 1.5.3 released Apache Derby 10.10.1.1 released Apache Wicket 6.7.0 Released! OpenJPA 1.2.3 Released Apache Camel 2.11.0 Released Apache Whirr 0.8.2 Released HttpComponents HttpClient 4.2.5 GA release Apache Sqoop 1.99.2 released Apache Tomcat 6.0.37 released Apache Lucene 4.3 released Apache Solr 4.3 released Apache Buildr 1.4.12 released Apache Gora 0.3 Released Apache Tomcat 7.0.40 released Apache Jackrabbit 2.7.0 released Apache JSPWiki 2.9.1-incubating released Apache Jena 2.10.1 released Apache Hive 0.11.0 Released Apache Subversion 1.8.0-rc2 Released SpringSource Spring Framework 4.0 M1 & 3.2.3 available Reactor – a foundation for asynchronous applications on the JVM SPRING SECURITY 3.1.4 RELEASED SPRING TOOL SUITE AND GROOVY/GRAILS TOOL SUITE 3.3.0.M1 RELEASED SPRING BATCH 2.2.0 RC1 IS NOW AVAILABLE SPRING MOBILE 1.1.0.M3 RELEASED JBoss Infinispan 5.3.0.Beta2 is out! RichFaces 4.3.2.Final Release Announcement JBossWS 4.2.0.Beta1 and WS-Discovery support Forge 1.3.0.Final Released IronJacamar 1.1.0.Beta5 is out ! Teiid 8.4 Beta2 Posted TorqueBox 2.3.1 Released JGroups 3.3.0.Final released RHQ 4.7 released jBPM 6.0 Beta2 available Arquillian Drone Extension 1.2.0.Alpha2 Released Hibernate ORM 4.3.0.Beta2 Released IBM Liberty Repository is up Tomcat Migration Kit Technology Preview Released Oracle Oracle ADF Mobile 1.1 Released Java EE 7 Scheduled for Release June 12th Oracle JDeveloper and ADF 11g Release 1 Scala Akka 2.1.4 Released Other IntelliJ IDEA is the base for Android Studio, the new IDE for Android developers Events No fluff just stuff TDC (The Developer's Conference) Florianopolis, Brazil - Event for developers, IT professionals and students, with a Java track. May 24-26 JUDCon / CamelOne, Boston, MA June 9-11, 2013 EclipseCon France, Toulouse, France June 5-6. QCon New York June 12-June 14. Oracle Technology Network Developer Day: Big Data, Reading, UK June 19th. ODTUG Kscope13 - New Orleans, LA, USA June 23-27 Oracle Technology Network Developer Day: Service Integration using Oracle SOA Suite 11g, London, UK June 26. TDC (The Developer's Conference) Sao Paulo - Event for developers, IT professionals and students, with a Java track. July 10-14 JavaOne China  Shanghai July 22-25. Scala Days New York, N June 10-12 JavaZone, Oslo, Norway Sep 11-12 JavaOne, San Francisco Sep 22-26 Devoxx Belgium, Antwerp November 11-15 jDays, Gothenburg, Sweden - Call for Papers ends Aug 25th. Nov 26-27

Unsupported Operation

Unsupported Operation 64Java / MiscJavaFX 2.1 gets MPEG4 playbackScala artifacts now in centralGithub's mashup of Jenkins called JankyThe state of IcedTea and IcedTeaWeb video from FOSDEMSpring Data JPA 1.1.0 RC1 and 1.0.3 GA Releasedhttp://bit.ly/xkOR9CPrimeFaces 3.0 - a year long development, its tagline is Ajax, Mobile and IE9 components. IE9 components????Scandal: ICEFaces is just a rip off of PrimeFacesSpring Roo 1.2.1 available, patch release which brings support for the new PrimeFaces and latest GAEQuery Time Joining makes it into Lucene 3.6 (but a different impl from 4.0 which is 3x faster)GoogleGoogle App Engine "Community Support" moved to Stack OverflowFails in its attempt to keep email out of court on AndroidHardware x-overSheeva Plug, the box from Globalscale that the FreedomBox is based on also has a JVM+OSGI kit on an SD card.Speaking of OSGi, Distributed OSGi RI 1.3 is out, based on Apache CXFApacheRichard moved to Maven 3.0.4 and is having no problemsApache Jackrabbit 2.4.0, 2.2.11 released http://jackrabbit.apache.org - lots of new features, fixes and improvements(not Java, but) Apache libcloud gone 0.8.0 http://libcloud.apache.org/Apache MyFaces CVE-2011-4367Apache MyFaces information disclosure vulnerability affects MyFaces 2.0.1 - 2.0.11, 2.1.0 - 2.1.5MyFaces JavaServer Faces (JSF) allows relative paths in thejavax.faces.resource 'ln' parameter or writing the url so the resourcename include '..' sequences . An attacker could use the securityvulnerability to view files that they should not be able to.http://://faces/javax.faces.resource/../WEB-INF/web.xmlMyFaces Core 2.0.12 and 2.1.6 releasedApache Directory Studio 2.0M2Apache Directory DS 2.0.0-M5Apache LDAP API 1.0.0-M10HttpClient 4.1.3 GAApache Hive 0.8.1 - distributed data warehouse on top of HadoopCommons Configuration 1.8Commons Validator 1.4Lucy 0.3 (incubating)Apache Lucy is full-text search engine library written in C and targeted at dynamic languages