July 10, 2021July 11, 2021 khuntzberry

Machine Learning Basics in PySpark

In this class I cover the basics of pre-processing and running machine learning models in PySpark. First, I review the concept of estimators versus transformers (this is an extremely important topic for the Databricks ML Practitioner Exam). I cover how to perform pre-processing steps including:

One-Hot encoding
Preparing categorical data for processing with StringIndexer
Missing data imputation
Rescaling features
Inputting data into a model with VectorIndexer

Finally, I cover how to run a machine learning model and how to combine all the above steps into a single pipeline optimized for deployment.

Data Is Available Here and Code Is Available Here

Suggested Reading:

Spark: The Definitive Guide, Chapters 24-27 (p. 401-488)
Learning Spark, 2nd Edition, Chapter 10 (p. 285-321)

Related