Data Manipulation Basics

In this session we cover PySpark basics such as reading in data, filtering, joining, selecting/dropping columns, and creating new columns. We also cover how to use Koalas, which allows you to use Python Pandas inside Databricks that uses parallel processing!


Code and Data Available Here

Suggested Reading:

  • Spark: The Definitive Guide, Chapter 5 (p. 59-81), Chapter 6 (p. 83-115), Chapter 7 (p. 117-137), and Chapter 9 (p. 153-158)
  • Learning Spark, 2nd Edition, Chapter 3 (p. 43-82)