spark.repartition(partitionsNum)
→ Repartition your file into any number of partitions you want.
Data Locality in Hadoop/HDFS:
- Hadoop achieves data locality by deploying the code on the executor/node that already has the data that serves that specific code to minimize i/o operations as much as possible.

- In spark → the data partitions are found in s3/GCS/ADLSg2 not on the spark executors themselves which means spark executors first loads the s3 bucket onto the node then starts executing.
GROUP BY under the hood:
-
Stage #1:
- typically at the start of the spark job, each spark executor gets assigned a data partition.
- In stage 1 the main goal for each spark executor is to group the data within each partition.
- intermediate output: the output of the first stage.

-
Stage #2:
- The main goal of this stage is to reshuffle the data across the partitions to put all data that corresponds to the same group within the same partition
- Then the data is reduced into one single record.
- This process is called External Merge Sort

JOIN under the hood:
- each record in each partition is viewed as a Key: value pair.
- the key: value pairs with the same keys are then reshuffled into the same partition.
- Reduce each record so that it follows (Key, yellow record, green record).
- Based on the records in the same partition and the type of the join spark chooses which records are returned.
