Spark & Databricks Optimization
Systematic optimization across eight performance domains — from predicate pushdown through join optimization and cluster sizing.
Your Skill Path
30 modules · Masterclasses, hands-on scenarios & timed mock tests
Read Optimization — Predicate Pushdown, Column/Row Elimination, Cache & Dynamic Pruning
A Delta table query scans all 500 partitions despite a date filter — eliminate unnecessary scan overhead
A reporting job reads an 80-column table but uses only 5 — reduce I/O using column elimination
Spark cache is applied to a DataFrame but job performance worsens — diagnose metadata overhead and fix caching strategy
Spark Memory Architecture — Allocation, Spill Detection & OOM Handling
A PySpark aggregation job is spilling 60GB to disk — identify spill using Spark UI and tune without changing business logic
A Spark job fails with OOM on the driver — diagnose memory allocation and resolve using configuration changes
A flatMap operation causes data explosion and partition sizes balloon — tune partition size to prevent spill
Partition Tuning — DataFrame Sizing, Memory Tuning, File Open Cost & AQE
A DataFrame with 10,000 shuffle partitions causes excessive task scheduling overhead — calculate and tune the right partition count
Reading thousands of small files causes slow job startup due to high file open cost — tune partition strategy to reduce overhead
Post-shuffle partitions are highly uneven despite AQE being enabled — diagnose and resolve shuffle skew using AQE configuration
Skew Handling — Data Salting, AQE Skew Join & Haystack Query Optimization
A join on customer_id sends 90% of data to one executor — resolve partition skew using data salting
AQE skew join is enabled but skew persists across a 3-way join — diagnose and resolve
A haystack query filtering on a low-cardinality column takes 45 min — identify the skewed key and optimize the query plan
Join Optimization — Bucketing, AQE Shuffle Join, Broadcast Problems & Intermediate Result Reuse
A pipeline re-runs the same large join 4 times across different stages — optimize by reusing intermediate join results
Two large tables shuffle fully on every run — implement bucketing to eliminate repeated shuffle overhead
A shuffle join with AQE enabled still causes excessive spill — tune AQE join configuration to fix spill
A broadcast join hint causes driver OOM when the broadcasted table unexpectedly grows — diagnose and resolve
File & Storage Optimization — Small File Problem, Auto Compact & Optimize Strategies
A streaming pipeline writing micro-batches creates thousands of small files per hour — resolve beyond Delta's native compaction
Auto Compact is enabled but query performance remains poor after 30 days of incremental writes — tune the optimize strategy
UDF Optimization — Apache Arrow, Pandas UDFs & Vectorization
A Python UDF processing 100M rows runs 10x slower than equivalent SQL — migrate to a Pandas UDF with Arrow optimization
A vectorized Pandas UDF produces incorrect results for null values — debug and fix the Arrow-based implementation
Cluster Sizing — Estimating Volume, CPU, Memory, Disk & Executor Config for SLAs
A production job consistently misses its 30-min SLA — estimate the right cluster config based on data volume and job profile
A cluster is over-provisioned with 60% idle resources causing high DBU costs — right-size to meet SLA at minimum cost
Ready to get started?
Get a walkthrough of this skill path and see how Enqurious can accelerate your growth on Databricks.
Request a Demo