Apache Spark Spark Performance Tuning

Question 1

What is data skew in Apache Spark and why is it a problem?

Accepted Answer

Uneven data distribution across partitions causing some tasks to take much longer than others

Answer

Data skew occurs when partitions have unequal sizes, making some tasks significantly slower and creating bottlenecks in the job.

Question 2

What does the broadcast join hint do in Spark SQL?

Accepted Answer

Replicates a small DataFrame to all executors to avoid shuffling the large DataFrame

Answer

Broadcast join sends a copy of the smaller DataFrame to all executor nodes, eliminating the expensive shuffle for the larger DataFrame.

Question 3

What is the purpose of the spark.sql.shuffle.partitions configuration?

Accepted Answer

Controls the number of partitions used for shuffle operations like joins and aggregations

Answer

spark.sql.shuffle.partitions controls the number of partitions created after a shuffle (default 200), affecting performance of joins and aggregations.

Question 4

What is speculative execution in Apache Spark?

Accepted Answer

Running duplicate copies of slow tasks on other nodes to handle stragglers

Answer

Speculative execution launches duplicate copies of straggler tasks on other nodes; whichever finishes first provides the result.

Question 5

Which of the following strategies helps avoid data skew in a join operation?

Accepted Answer

Salting the join key with a random prefix

Answer

Salting adds a random prefix to skewed keys, distributing hot keys across multiple partitions to avoid overloading a single task.

Apache Spark Practice Test

Apache Spark Practice Test

Apache Spark Spark Performance Tuning