1- Catalyst Optimizer:

Responsible for rule-based optimizations and

2- Cost-Based Optimizer:

Responsible for cost-based optimization

3- Code Optimization:

1️⃣ Shuffle Optimization

Shuffles are expensive because they involve:

disk I/O
network transfer
serialization

Common shuffle triggers:

join
groupBy
distinct
orderBy
repartition

Optimization techniques:

broadcast joins → small table is broadcasted across the executors to make the executors perform the join locally
- when to use?
  - when there’s a small table of ( 1m rows or small table < 10–100 MB) because the table is broadcasted to the executor's memory.
reduce partitions
partition keys carefully
- Choose keys that are:
  - high cardinality
  - evenly distributed

2️⃣ Partition Optimization

Partitions determine parallelism.
- Too few partitions: → CPU underutilization
- Too many partitions: → scheduler overhead
spark.sql.files.maxPartitionBytes → spark decides what will be start number of partitions based on this param
Typical rule:
- Partitions ≈ 2–4 × CPU cores