As asked
You have a Spark job on Databricks that is taking three times longer than expected. The Spark UI shows most of the time is in the shuffle read stage. Walk me through how you diagnose what is causing the slow shuffle, and what changes you would make to the job or cluster configuration to fix it.
Sample answer outline
A strong answer checks for data skew first by looking at task duration variance in the Spark UI stage view. If one or a few tasks take 100x longer, skew is the likely cause and salting the join key or using skew hints in Databricks SQL is the fix. If skew is not the problem, the candidate should consider shuffle partition count (spark.sql.shuffle.partitions defaults to 200, which can be too low or too high for the data volume), disk spill to executor local storage, and network bandwidth. They should also mention Adaptive Query Execution in Spark 3 which can coalesce partitions automatically.
Expect these follow-ups
- How would you use Adaptive Query Execution in Databricks Runtime to reduce shuffle overhead automatically?
- Explain the difference between a sort-merge join and a broadcast hash join and when each is appropriate.