Pyspark - How to preprocess Large Scale Data with Python | Notion

Writer: Joonwon Jang

Contact: [email protected]

0. Prerequisites For Practice

Spark runs on Java 8, 11, or 17.
- https://www.oracle.com/java/technologies/downloads/#jdk17-mac
Install Apache Spark (Framework that powered the pyspark)
- https://spark.apache.org/downloads.html
- Latest Version의 Pre-built for Apache Hadoop 3.3 and later download
Spark runs on Python 3.7+
Install the following libraries in Python

pip install pyspark findspark
conda install jupyter

Launch jupyter lab in python

jupyter-lab

1. Introduction to Spark

Spark

스크린샷 2024-07-16 오전 2.09.14.png

대규모 데이터 처리를 위한 분석 엔진 (unified analytics engine for large-scale data processing)
In-Memory (RAM) (Hadoop은 Disk에서 처리되기 때문에 여기서 속도차이가 발생합니다.)
- Hadoop과 달리 MapReduce 로부터 map/reduce할 데이터를 불러오고, 처리 결과를 메모리에 씁니다.