Apache Spark Architecture Questions

Covers core Apache Spark architecture and programming model, including the roles of the driver and executors, cluster manager options, resource allocation, executor memory and cores, partitions, tasks, stages, and the directed acyclic graph used for job execution. Explains lazy evaluation and the distinction between transformations and actions, fault tolerance mechanisms, caching and persistence strategies, partitioning and shuffle behavior, broadcast variables and accumulators, and techniques for performance tuning and handling data skew. Compares Resilient Distributed Datasets, DataFrames, and Datasets, describing when to use each API, the benefits of the DataFrame and Spark SQL APIs driven by the Catalyst optimizer and Tungsten execution engine, and considerations for user defined functions, serialization, checkpointing, and common data sources and formats.

MediumTechnical

27 practiced

Explain the tradeoffs between broadcasting a small table and performing a shuffle join. Describe how Spark decides to use a broadcast join automatically and which configuration parameters (e.g., spark.sql.autoBroadcastJoinThreshold) influence that decision. How do you forcibly choose one strategy over the other?

MediumTechnical

22 practiced

Describe partitioning strategies in Spark for large joins and aggregations. When would you use hash partitioning vs range partitioning or a custom partitioner? Give examples for time-series joins, skewed key distributions, and multi-column composite keys.

MediumTechnical

30 practiced

Given a pipeline that performs join -> groupBy -> write, describe strategies to reduce network shuffle and task overhead. Discuss use of spark.sql.shuffle.partitions tuning, map-side combine, broadcasting, repartitioning before expensive stages, and trade-offs of coalesce vs repartition.

EasyTechnical

27 practiced

Explain the roles and responsibilities of the Spark driver and Spark executors in a distributed Spark application. Your answer should cover where the SparkContext lives, how tasks are scheduled from the driver to executors, what metadata/state is held by the driver versus executors, how memory and cores are allocated to each, and the observable failure modes when a driver or an executor fails in production.

EasyTechnical

31 practiced

In PySpark, demonstrate how to persist a DataFrame using an appropriate StorageLevel, ensure it survives transient executor failures if possible, and show how to properly unpersist it. Include a short code example and explain when to choose MEMORY_ONLY, MEMORY_AND_DISK, or serialized storage.

Unlock Full Question Bank

Get access to hundreds of Apache Spark Architecture interview questions and detailed answers.

Join thousands of developers preparing for their dream job.