InterviewStack.io LogoInterviewStack.io

Data Lake and Warehouse Architecture Questions

Designing scalable data platforms for analytical and reporting workloads including data lakes, data warehouses, and lakehouse architectures. Key topics include storage formats and layout including columnar file formats such as Parquet and table formats such as Iceberg and Delta Lake, partitioning and compaction strategies, metadata management and cataloging, schema evolution and transactional guarantees for analytical data, and cost and performance trade offs. Cover ingestion patterns for batch and streaming data including change data capture, data transformation approaches and compute engines for analytical queries, partition pruning and predicate pushdown, query optimization and materialized views, data modeling for analytical workloads, retention and tiering, security and access control, data governance and lineage, and integration with business intelligence and real time analytics. Also discuss operational concerns such as monitoring, vacuuming and compaction jobs, metadata scaling, and strategies for minimizing query latency while controlling storage cost.

MediumTechnical
77 practiced
Implement a simplified Python utility that given a list of Parquet file footers (represented as dicts with 'min','max' for a column) and a predicate value, returns the subset of files that must be scanned. Input example:
files = [{'path':'f1','min':1,'max':10},{'path':'f2','min':11,'max':20}]predicate = ('col','>=',12)
Write select_files(files, predicate). Provide Python code.
MediumTechnical
62 practiced
A report team needs daily snapshots of a slowly changing dimension (SCD Type 2) in your warehouse. Explain how you'd implement SCD Type 2 using a data lake/lakehouse architecture. Discuss keys, effective date ranges, updates vs inserts, and how to query the current and historical state efficiently.
MediumTechnical
73 practiced
Design an experiment to quantify the impact of file size on query cost and latency for a representative analytic query against Parquet data on S3. Describe the variables to control (file size, row-group size, number of objects), metrics to measure, and how you'd interpret results to choose an optimal file size.
HardTechnical
76 practiced
A data product requires exactly-once semantics for event ingestion into an analytical table. Explain the difference between at-least-once, at-most-once, and exactly-once delivery. Then propose how you would achieve exactly-once semantics end-to-end using Kafka, Debezium (CDC), and Iceberg or Delta Lake as the sink.
HardTechnical
59 practiced
Describe practical strategies to minimize the impact of metadata-heavy queries on the metadata service (e.g., Glue or Hive Metastore) including caching, query planning separation, and pre-warmed workers. How would you protect the metastore from high-concurrency query spikes?

Unlock Full Question Bank

Get access to hundreds of Data Lake and Warehouse Architecture interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.