InterviewStack.io LogoInterviewStack.io

Apache Spark Architecture Questions

Covers core Apache Spark architecture and programming model, including the roles of the driver and executors, cluster manager options, resource allocation, executor memory and cores, partitions, tasks, stages, and the directed acyclic graph used for job execution. Explains lazy evaluation and the distinction between transformations and actions, fault tolerance mechanisms, caching and persistence strategies, partitioning and shuffle behavior, broadcast variables and accumulators, and techniques for performance tuning and handling data skew. Compares Resilient Distributed Datasets, DataFrames, and Datasets, describing when to use each API, the benefits of the DataFrame and Spark SQL APIs driven by the Catalyst optimizer and Tungsten execution engine, and considerations for user defined functions, serialization, checkpointing, and common data sources and formats.

MediumTechnical
0 practiced
In Scala, implement a custom Partitioner for RDD operations that partitions data by a composite key (user_id, date) in a way that all records for the same user and date go to the same partition. Show the Partitioner class and a short example of applying it to an RDD of ((user_id:String, date:String), value).
HardTechnical
0 practiced
Provide a deep comparison of RDD vs DataFrame vs Dataset internal representation and execution path. Explain how Catalyst transforms a logical plan into a physical plan, what physical operators might look like, and how Tungsten changes memory layout and code generation for improved CPU and cache efficiency.
HardTechnical
0 practiced
You are migrating a multi-tenant Spark deployment from YARN to Kubernetes. Describe the key architectural changes and operational considerations such as pod resource isolation, executor memoryOverhead, dynamic allocation differences, logging/metrics collection, security (IAM, RBAC), and how to handle native dependencies and HDFS access from pods.
HardTechnical
0 practiced
Implement in PySpark a scalable technique to join a large fact DataFrame with a skewed key (hot keys) to a dimension DataFrame using salting. Provide code to salt keys on both sides, perform the join, and then merge results to remove the salt; explain trade-offs and how to choose salt factor.
EasyTechnical
0 practiced
Describe the roles of the Catalyst optimizer and Tungsten execution engine in Spark SQL. Explain in plain terms what they optimize (logical rules, physical planning, code generation, memory layout) and how those optimizations lead to faster DataFrame queries compared to naive transformations.

Unlock Full Question Bank

Get access to hundreds of Apache Spark Architecture interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.