Data Engineering is the plumbing of data science. Almost invisible, but super important and a big mess when done wrong. We talk about interesting Data Engineering trends and topics. I also train Data Engineering in my Data Engineering Academy at LearnData
I have seen people doing that wrong a few times. Luckily Wouter Trappers who is helping companies as a professional can help. We talked about The steps you need to take from value proposition to dashboards. Wouter is really knowledgeable and it was super fun talking with him and hearing his approach.
Dhruba Borthakur is CTO at Rockset and a passionate Data Engineer. Before co-founding Rockset he played a big role in development of Hadoop HDFS at Yahoo as well as HBase and RocksDB at Facebook. His current project is the serverless Rockset platform where you can gain real time analytics insight into your data. I tried it out before our talk and really liked it.
The Podcast is back!!!! I promise I am going to keep it up to date this time ;) In this episode I talk about my newest Data Engineering course. I think it's the ultimate 1 hour 15 minutes introduction to Data Engineering. There were also a ton of questions from the chat that I answered. Think you really enjoy this.
We are looking into the network communication protocol map. I first saw this like 10 years ago and its awesome. Then we check out the Big Data and Data Science Landscape image. It shows you all the tools available to do data science, machine learning and data engineering. Which is very helpful if you are researching for tools to use. Before using the Twitter API you got to create a developer account. So, I show you how I created one. After that I tried to get Nifi to download Tweets but it is not working.
Today's podcast is a bit of a behind the scenes. What it takes to do a audio podcast. How you can get audio to text transcriptions for free. .Also Github questions on how to work with branches on the Cookbook
Today a case study about OLX with a guest it was super fun! Here are the slides Alexeyand I talked about: https://www.slideshare.net/mobile/AlexeyGrigorev/image-models-infrastructure-at-olx
In this episode we install the Nifi docker container and look into how we can extract the twitter data. We are also talking about the differences between infrastructure as a service, platform as a service and application as a service.
In this episode we look into the Twitter API documentation, which I love by the way. How can we get old tweets for a certain hashtags and how to get current live tweets for these hashtags.
Tips on how you find a job in Germany and two super interesting mails.
The cookbook my Youtube, it will be for free, forever! Check out the data engineering cookbook on GitHub: https://github.com/andkret/Cookbook
Updates of the cookbook and how to collaborate on it
In this episode we talk about the lambda architecture with stream and batch processing as well as a alternative the Kappa Architecture that consists only of streaming. Also Data engineer vs data scientist and we discuss Andrew Ng's AI Transformation Playbook
How do you choose between Cloud vs On-Premise, pros and cons and what you have to think about. Because there are good reasons to not go cloud. Also thoughts on how to choose between the cloud providers by just comparing instance prices. Otherwise the comparison will drive you insane.
In this episode we go over the ideas I have for the data engineering course structure. It was your chance for you to influence what we put in there.
In this video we go over some of the 100+ comments I received on LinkedIn about a data engineering training.
Let's check out how LinkedIn is processing data
How is Twitter doing Data Engineering? Oh man, they have a lot of cool things to share these tweets.
In this episode we are looking at the data engineering at Spotify, my favorite music streaming service. How do they process all that data?
In this podcast we look at the engineering culture at Spotify, my favorite music streaming service. The process behind the development of Spotify is really awesome.
A look into how Pinterest is doing data engineering.
Configuring a sub 1000 dollar PC for data engineering and machine learning Link to the builds: 900$ build: https://pcpartpicker.com/list/22ThcY 1500$ build: https://pcpartpicker.com/list/hXJdV6
A look into how NASA is doing data engineering.
A simple introduction how to do data science in the context of the internet of things.
A look into how CERN is doing Data Engineering. They get huge amounts of data from the Large Hydron Colider. Let's check it out.
A look into how booking.com is doing data engineering.
A look into how Airbnb is doing Data Engineering.
How Netflix is doing Data Engineering using their Keystone platform
I decided to rework the cookbook focusing more on case studies and less on explaining tools. People keep asking me for a path to become a data engineer and, let's be honest, you will never achieve that with just knowledge of the tools. Finding out how companies do data engineering on their data science platforms is way more useful. Over the next weeks we will go over each study on my YouTube channel. The stuff we talk about will then go into the cookbook too.
A Introduction into Hadoop HDFS, YARN and MapReduce. Yes, Hadoop is still relevant in 2019 even if you look into serverless tools.
The Internet of things is a huge deal. There are many platforms available. But, which one is actually good? Join me on a 50 minute dive into the Siemens Mindsphere online documentation. I have to say I was super unimpressed by what I found. Many limitations, unclear architecture and no pricing available? Not good!
A stream full of mediocre guitar playing and great Q&A about Hadoop.
I have created a Medium Publication especially for us Plumbers of Data Science who work in Data Engineering and Big Data. It's called, you guessed it, Plumbers of Data Science.
What is the difference between SQL and NoSQL? In this episode I show you on the example of HBase how a key/value store works.
On this podcast I talk about data warehouses and data lakes. When do people use which? What are the pros and cons of both? Architecture examples for both and does it make sense to completely move to a data lake?
In this episode I talk about how you can gain a competitive edge on the job market. It's super simple, you can and should start with it TODAY by putting yourself out there.
The Data Science Hype is still strong. Where's the industry going, towards a cliff? Here's what can you do?
In this episode I show you the first version of my data engineering cookbook.
Getting a book and reading it cover to cover is useless. In this episode I show you my strategy of buying books complimentary to your work. And 5 great books I read over the years that helped me get where I am now.
In this podcast we talk about the differences between data scientists, analysts and engineers. Which are the three main data science jobs. All three super important.
After all the BS solutions using Blockchain I thought I finally found one that makes sense. Of all the possibilities it's the EU data protection law GDPR. Well, one problem I overlooked in this podcast is, that it is impossible to delete data after it is in the chain. That's however a rule for GDPR. So, I was wrong. Again :D
In this episode Kate Strachnyi interviews me for her humans of data science podcast. We talk about how I found out that I am more into the engineering part of data science.
In this episode I show you how much data science graduates are actually payed in Germany. All over the internet you can find that Data Science salary is over 100k Dollars. Data Engineer or Data Scientist. It's way lower then that. Then I give you a few really good tips on how to choose the right company to work for. Huge corporation, startup or small company? Here's how to choose.
In this podcast I am showing you how I use GitHub to write my Data Engineering Cookbook with LaTex.
What is the best editing tool to write a thesis, a dissertation or a paper? NOT Word or Pages! It's LaTeX. In today's video I show you why I decided to use LaTeX to write my data engineering cookbook. I used it before for my diploma thesis and I am in love again :) Here's the link to the cheatsheet: https://wch.github.io/latexsheet/latexsheet.pdf Check out my Patreon for the Data Engineering Cookbook: http://bit.ly/PatreonAndreasKretz Music: "Day One" by Declan DP https://soundcloud.com/declandp Attribution 3.0 Unported https://creativecommons.org/licenses/by/3.0/
You have certifications or a university degree, but can't find a job? Sharing your ideas and knowledge will increase your chances! Here's how you can do that. Music: "Day One" by Declan DP https://soundcloud.com/declandp Attribution 3.0 Unported https://creativecommons.org/licenses/by/3.0/
I love agile development. People keep telling you to do Scrum, like it's the only and best choice to be agile. It's not. Here's my take on scrum and my four main beefs with it. Watch out for these issues if you are doing scrum.
So, Cloudera and Hortonworks merge... In today's Plumbers of Data Science Podcast I talk about what these, big data vendors do. How they enable companies, admins and developers to do data science and many more things. If you are interested in the whole hadoop ecosystem you need to check out this episode. You won't regret it ;)
Is ETL dead in Data Science and Big Data? In today's podcast I share with you my views on your questions regarding ETL (extract, transform, load). Data Lakes & Data Warehouse where is the difference? Is ETL still practiced or did pre processing & cleansing replace it What would replace ETL in Data Engineering? How to become a data engineer? (check out my facebook note) How to get experience training at home? Real time analytics with RDBMS or HDFS?
What's the difference between Data Scientists & Data Analysts? What to do to find internships or a full time job? Data Scientist and Engineer in large and small companies where's the difference? Are Data Engineers generalists or specialists? Just some questions I go over in this podcast. You sent me over 100 Questions so, I finally worked up the guts to start with the Q&A videos. Answering your questions one by one. Turns out it's a lot of fun :)
Without the proper tools and techniques of version control the team's efficiency goes down the drain. In this episode I talk about how tools like Jira enable you to collect bugs, future features or change requests. How they enable you to create and organize versions, add items to a version and assign items to developers. Once this is done, the team can efficiently start coding with the help of source code management systems like GitHub. How does all that work? Check out this episode to find out :)
You need to become comfortable with distributed processing. Data Science or the Internet of Things, the amount of data that is getting produced and processed grows like crazy. In this podcast I talk about how a platform for distributed processing looks like. I talk about the different layers that need parallelization, as well as the tools you can use for on premise installations or clouds like AWS, Azure or Google Cloud. Big Data tools like Kafka, Spark or server less like Kinesis or Lambda functions.