spark.repartition(partitionsNum) → Repartition your file into any number of partitions you want.
- It’s a lazy Operations

Data Locality in Hadoop/HDFS:

Hadoop achieves data locality by deploying the code on the executor/node that already has the data that serves that specific code to minimize i/o operations as much as possible.

In spark → the data partitions are found in s3/GCS/ADLSg2 not on the spark executors themselves which means spark executors first loads the s3 bucket onto the node then starts executing.

GROUP BY under the hood:

Stage #1:
- typically at the start of the spark job, each spark executor gets assigned a data partition.
- In stage 1 the main goal for each spark executor is to group the data within each partition.
- intermediate output: the output of the first stage.
Stage #2:
- The main goal of this stage is to reshuffle the data across the partitions to put all data that corresponds to the same group within the same partition
- Then the data is reduced into one single record.
- This process is called External Merge Sort

each record in each partition is viewed as a Key: value pair.
the key: value pairs with the same keys are then reshuffled into the same partition.
Reduce each record so that it follows (Key, yellow record, green record).
Based on the records in the same partition and the type of the join spark chooses which records are returned.