Streaming Data Using Spark with Dustin Vannoy

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/c9/58/cd/c958cd43-3563-b1c6-d7ad-71fd54e1cc0c/mza_4031104551652385663.jpg/600x600bb.jpg

Free the Data Podcast

Free the Data Academy

10 episodes

3 days ago

Join Ben Sullins as he talks Data Science and AI with people advancing the field in interesting ways.

Careers

Business

RSS

All content for Free the Data Podcast is the property of Free the Data Academy and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Join Ben Sullins as he talks Data Science and AI with people advancing the field in interesting ways.

Careers

Business

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/41182765/41182765-1715361811820-1d2a9b60e7e38.jpg

Streaming Data Using Spark with Dustin Vannoy

Free the Data Podcast

1 hour 16 minutes 50 seconds

1 year ago

Streaming Data Using Spark with Dustin Vannoy

Data engineering has historically involved extracting data from disperate sources, transforming it to a standard layout, and then loading it into a new database for analytics. Usually these data engineering pipeline jobs would run on a schedule such as nightly or weekly. In today's fastpaced high-tech world however the need for data closer to real-time, meaning when it was first generated, is higher than ever. In today's episode we hear from Dustin Vannoy who is a consultant and blogger in the streaming data space about how to use Apache Spark, the most popular streaming analytics platform. How to connect with Dustin: - WEBSITE: https://dustinvannoy.com/ - TWITTER: / dustinvannoy - LINKEDIN: / dustinvannoy - YOUTUBE: / @dustinvannoy Learn data skills at our academy and elevate your career. Start for free at https://ftdacademy.com/YT Chapters: 0:00:00 Intro 0:01:01 Dustin's Background 0:09:51 Transitioning from legacy databases to Big Data and Streaming 0:13:29 Microbatching vs Streaming 0:18:17 What is Spark and why use it? 0:22:33 Apache Spark vs Data Bricks 0:26:24 Pay for a hosted Spark version or roll your own? 0:28:27 Databricks setup 0:30:25 How Databricks executes queries 0:32:41 Scaling approaches to Spark 0:35:14 Connecting to external databases in databricks 0:37:51 Visualizing data in Databricks 0:39:40 Using Spark for ETL work 0:42:50 What is real-time processing? 0:44:25 How to build a streaming job in Spark using Kafka 0:46:18 Streaming architecture overview 0:49:15 Pulling data from Kafka into Spark streaming 0:51:09 Why apps use Kafka 0:54:33 Why use Spark versus alternatives 0:57:37 What is Confluent? 0:59:38 Ways to learn Spark 1:02:04 How hard is Spark to learn? 1:04:16 Troubleshooting errors in Spark 1:07:03 How hard is it to transition to Spark from traditional databases? 1:11:51 Interviewing for a Spark job 1:15:46 Outro