Apache Spark Architecture Questions

Covers core Apache Spark architecture and programming model, including the roles of the driver and executors, cluster manager options, resource allocation, executor memory and cores, partitions, tasks, stages, and the directed acyclic graph used for job execution. Explains lazy evaluation and the distinction between transformations and actions, fault tolerance mechanisms, caching and persistence strategies, partitioning and shuffle behavior, broadcast variables and accumulators, and techniques for performance tuning and handling data skew. Compares Resilient Distributed Datasets, DataFrames, and Datasets, describing when to use each API, the benefits of the DataFrame and Spark SQL APIs driven by the Catalyst optimizer and Tungsten execution engine, and considerations for user defined functions, serialization, checkpointing, and common data sources and formats.

HardTechnical

27 practiced

You're migrating an on-prem Spark 2.x cluster to Spark 3.x running on Kubernetes for ML workloads. Create a migration checklist covering API compatibility, Scala/Python versions, dependency (jar/conda/pip) handling, changes in default configs (e.g., AQE), shuffle and serializer changes, testing strategies, and a rollback plan.

EasyTechnical

28 practiced

When a Spark ML training job runs slowly, which tabs and metrics in the Spark UI and History Server do you inspect first? Provide a short troubleshooting checklist describing what to look for in stages, tasks, executors, storage, and SQL tabs to identify causes such as skew, shuffle bottlenecks, or GC overhead.

MediumSystem Design

33 practiced

Design a Spark-based batch feature pipeline that turns 100 TB of raw logs into a 5 TB daily feature store consumed by ML training. Describe partitioning strategy, file format, incremental runs vs full rebuilds, handling small files, and how to minimize shuffle and execution time.

MediumTechnical

27 practiced

Explain how the Catalyst optimizer and Tungsten execution engine improve DataFrame performance. Provide concrete examples of optimizations Catalyst performs (e.g., predicate pushdown, projection pruning, join reordering) and Tungsten's contribution (whole-stage codegen, off-heap memory). How do these affect ML pipelines?

MediumTechnical

25 practiced

Discuss how to achieve fault-tolerant exactly-once semantics for Structured Streaming when writing to sinks like Kafka and files. Explain the role of checkpointing, idempotent sinks, write-ahead logs (WAL), and common pitfalls when using custom sinks for ML inference or feature materialization.

Unlock Full Question Bank

Get access to hundreds of Apache Spark Architecture interview questions and detailed answers.

Join thousands of developers preparing for their dream job.