Introduction to PySpark

Class Overview

Below are materials for my ongoing Introduction to PySpark course. This class will cover the foundational topics of big data analysis with PySpark in Databricks including:

  • Spark architecture
  • Basic data transformations
  • DataFrame joins and aggregations
  • Optimmizing PySpark code
  • Koalas, or using Python Pandas syntax for parallel processing
  • Working with Delta Lake
  • Modeling with MLlib
  • Model deployment and monitoring with MLFlow

For the ongoing class schedule please join my Data Science and Big Data Meetup

Class Github

Code and data associated with each course can be found on Github here: https://github.com/kelsey-huntzberry/DataAnalysisLab.

Class 1: Spark Architecture & Data Exploration

Class Video & Materials

Class 2: Data Manipulation Basics

Class Video & Materials

Class 3: Optimizing PySpark Code

Class Video & Materials

Class 4: Machine Learning Basics in PySpark

Class Video & Materials

Class 5: Model Maintenance & Tracking with MLFlow

Class Video & Materials