InterviewStack.io LogoInterviewStack.io

Apache Spark Architecture Questions

Covers core Apache Spark architecture and programming model, including the roles of the driver and executors, cluster manager options, resource allocation, executor memory and cores, partitions, tasks, stages, and the directed acyclic graph used for job execution. Explains lazy evaluation and the distinction between transformations and actions, fault tolerance mechanisms, caching and persistence strategies, partitioning and shuffle behavior, broadcast variables and accumulators, and techniques for performance tuning and handling data skew. Compares Resilient Distributed Datasets, DataFrames, and Datasets, describing when to use each API, the benefits of the DataFrame and Spark SQL APIs driven by the Catalyst optimizer and Tungsten execution engine, and considerations for user defined functions, serialization, checkpointing, and common data sources and formats.

MediumTechnical
22 practiced
Discuss the trade-offs between using plain Python UDFs, Pandas (vectorized) UDFs, and native Spark SQL functions for feature transformations in PySpark. When are Pandas UDFs a good fit, and how do you mitigate serialization and memory overhead when using them?
MediumTechnical
24 practiced
You have a 500 GB feature extraction job and a cluster with nodes that have 64 GB RAM and 16 cores each. Describe how you would size executors (memory and cores) and decide the number of executors per node for a CPU-bound DataFrame pipeline. Include considerations for memory overhead, network, OS processes, and parallelism trade-offs.
MediumTechnical
30 practiced
You launch an ML job and find it is slower than expected. Describe a systematic process to profile and debug the job: what logs, metrics, and experiments would you run to identify whether the cause is data skew, serialization overhead, small files, GC, shuffle I/O, or network contention?
EasyTechnical
23 practiced
Explain the roles and responsibilities of the Spark driver and executors in a distributed Spark application used for ML model training. In your answer describe what runs on the driver versus executors, how tasks and stages are scheduled, how model checkpoints or aggregated metrics are coordinated, and common driver/executor bottlenecks you might encounter in production.
EasyTechnical
48 practiced
Compare Spark cluster manager options (YARN, Kubernetes, Standalone, Mesos). For a cloud-native ML team that wants containerized, reproducible training and autoscaling, which manager would you recommend and why? Discuss deployment, isolation, and lifecycle differences important to ML workloads.

Unlock Full Question Bank

Get access to hundreds of Apache Spark Architecture interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.