InterviewStack.io LogoInterviewStack.io

Apache Spark Architecture Questions

Covers core Apache Spark architecture and programming model, including the roles of the driver and executors, cluster manager options, resource allocation, executor memory and cores, partitions, tasks, stages, and the directed acyclic graph used for job execution. Explains lazy evaluation and the distinction between transformations and actions, fault tolerance mechanisms, caching and persistence strategies, partitioning and shuffle behavior, broadcast variables and accumulators, and techniques for performance tuning and handling data skew. Compares Resilient Distributed Datasets, DataFrames, and Datasets, describing when to use each API, the benefits of the DataFrame and Spark SQL APIs driven by the Catalyst optimizer and Tungsten execution engine, and considerations for user defined functions, serialization, checkpointing, and common data sources and formats.

MediumTechnical
0 practiced
Explain how the Catalyst optimizer and Tungsten execution engine improve DataFrame performance. Provide concrete examples of optimizations Catalyst performs (e.g., predicate pushdown, projection pruning, join reordering) and Tungsten's contribution (whole-stage codegen, off-heap memory). How do these affect ML pipelines?
HardSystem Design
0 practiced
Design a multi-tenant Spark architecture on Kubernetes intended for a company running ML training and feature pipelines for multiple teams. Cover namespaces, resource quotas, scheduling (node selectors/taints), security (RBAC), autoscaling, isolation to prevent noisy neighbors, and how to handle shared data access.
HardTechnical
0 practiced
A mission-critical Spark ML job intermittently runs 10x slower than normal and many executors show frequent GC pauses and spills to disk. Provide a systematic debugging and tuning plan listing specific commands, logs to collect (GC logs, executor heap histograms), configuration knobs to change (`spark.executor.memoryOverhead`, `spark.memory.fraction`, serializer changes), and short-term vs long-term fixes.
MediumSystem Design
0 practiced
Design an architecture to join a streaming Kafka events topic with a large static user profile table stored in Parquet (updated daily). Discuss approaches to perform the join in Structured Streaming (broadcasting static profiles, loading as a stream of updates, maintaining state), and trade-offs in memory and latency.
EasyTechnical
0 practiced
Explain the difference between Spark persistence (cache/persist) and checkpointing. In what scenarios would you use checkpointing for long lineage graphs or Structured Streaming state, and what impact does checkpointing have on job recovery and lineage truncation?

Unlock Full Question Bank

Get access to hundreds of Apache Spark Architecture interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.