
Data engineering has historically involved extracting data from disperate sources, transforming it to a standard layout, and then loading it into a new database for analytics. Usually these data engineering pipeline jobs would run on a schedule such as nightly or weekly. In today's fastpaced high-tech world however the need for data closer to real-time, meaning when it was first generated, is higher than ever. In today's episode we hear from Dustin Vannoy who is a consultant and blogger in the streaming data space about how to use Apache Spark, the most popular streaming analytics platform. How to connect with Dustin: - WEBSITE: https://dustinvannoy.com/ - TWITTER: / dustinvannoy - LINKEDIN: / dustinvannoy - YOUTUBE: / @dustinvannoy Learn data skills at our academy and elevate your career. Start for free at https://ftdacademy.com/YT Chapters: 0:00:00 Intro 0:01:01 Dustin's Background 0:09:51 Transitioning from legacy databases to Big Data and Streaming 0:13:29 Microbatching vs Streaming 0:18:17 What is Spark and why use it? 0:22:33 Apache Spark vs Data Bricks 0:26:24 Pay for a hosted Spark version or roll your own? 0:28:27 Databricks setup 0:30:25 How Databricks executes queries 0:32:41 Scaling approaches to Spark 0:35:14 Connecting to external databases in databricks 0:37:51 Visualizing data in Databricks 0:39:40 Using Spark for ETL work 0:42:50 What is real-time processing? 0:44:25 How to build a streaming job in Spark using Kafka 0:46:18 Streaming architecture overview 0:49:15 Pulling data from Kafka into Spark streaming 0:51:09 Why apps use Kafka 0:54:33 Why use Spark versus alternatives 0:57:37 What is Confluent? 0:59:38 Ways to learn Spark 1:02:04 How hard is Spark to learn? 1:04:16 Troubleshooting errors in Spark 1:07:03 How hard is it to transition to Spark from traditional databases? 1:11:51 Interviewing for a Spark job 1:15:46 Outro