Machine Learning Basics in PySpark

In this class I cover the basics of pre-processing and running machine learning models in PySpark. First, I review the concept of estimators versus transformers (this is an extremely important topic for the Databricks ML Practitioner Exam). I cover how to perform pre-processing steps including:

  • One-Hot encoding
  • Preparing categorical data for processing with StringIndexer
  • Missing data imputation
  • Rescaling features
  • Inputting data into a model with VectorIndexer

Finally, I cover how to run a machine learning model and how to combine all the above steps into a single pipeline optimized for deployment.

Data Is Available Here and Code Is Available Here

Suggested Reading:

  • Spark: The Definitive Guide, Chapters 24-27 (p. 401-488)
  • Learning Spark, 2nd Edition, Chapter 10 (p. 285-321)