Introduction to PySpark

Class Overview

Below are materials for my ongoing Introduction to PySpark course. This class will cover the foundational topics of big data analysis with PySpark in Databricks including:

Spark architecture
Basic data transformations
DataFrame joins and aggregations
Optimmizing PySpark code
Koalas, or using Python Pandas syntax for parallel processing
Working with Delta Lake
Modeling with MLlib
Model deployment and monitoring with MLFlow

For the ongoing class schedule please join my Data Science and Big Data Meetup

Class Github

Code and data associated with each course can be found on Github here: https://github.com/kelsey-huntzberry/DataAnalysisLab.

Class 1: Spark Architecture & Data Exploration

Class Video & Materials

introduction_to_pyspark_session1 Download

Class 2: Data Manipulation Basics

Class Video & Materials

Class 3: Optimizing PySpark Code

Class Video & Materials

introduction_to_pyspark_session3 Download

Class 4: Machine Learning Basics in PySpark

Class Video & Materials

pyspark_machine_learning_session4 Download

Class 5: Model Maintenance & Tracking with MLFlow

Class Video & Materials

pyspark_machine_learning_session5 Download

Kelsey Emnett, Data Scientist

Delivering Data-Driven Insights and Solutions

Introduction to PySpark

Class Overview

Class Github

Class 1: Spark Architecture & Data Exploration

Class 2: Data Manipulation Basics

Class 3: Optimizing PySpark Code

Class 4: Machine Learning Basics in PySpark

Class 5: Model Maintenance & Tracking with MLFlow