1- Catalyst Optimizer:
- Responsible for rule-based optimizations and
2- Cost-Based Optimizer:
- Responsible for cost-based optimization
3- Code Optimization:
1️⃣ Shuffle Optimization
Shuffles are expensive because they involve:
- disk I/O
- network transfer
- serialization
Common shuffle triggers:
join
groupBy
distinct
orderBy
repartition
Optimization techniques:
- broadcast joins → small table is broadcasted across the executors to make the executors perform the join locally
- when to use?
- when there’s a small table of ( 1m rows or small table < 10–100 MB) because the table is broadcasted to the executor's memory.
- reduce partitions
- partition keys carefully
- Choose keys that are:
- high cardinality
- evenly distributed
2️⃣ Partition Optimization
- Partitions determine parallelism.
- Too few partitions: → CPU underutilization
- Too many partitions: → scheduler overhead
spark.sql.files.maxPartitionBytes → spark decides what will be start number of partitions based on this param
- Typical rule:
- Partitions ≈ 2–4 × CPU cores