Data Lake and Warehouse Architecture Questions

Designing scalable data platforms for analytical and reporting workloads including data lakes, data warehouses, and lakehouse architectures. Key topics include storage formats and layout including columnar file formats such as Parquet and table formats such as Iceberg and Delta Lake, partitioning and compaction strategies, metadata management and cataloging, schema evolution and transactional guarantees for analytical data, and cost and performance trade offs. Cover ingestion patterns for batch and streaming data including change data capture, data transformation approaches and compute engines for analytical queries, partition pruning and predicate pushdown, query optimization and materialized views, data modeling for analytical workloads, retention and tiering, security and access control, data governance and lineage, and integration with business intelligence and real time analytics. Also discuss operational concerns such as monitoring, vacuuming and compaction jobs, metadata scaling, and strategies for minimizing query latency while controlling storage cost.

MediumTechnical

0 practiced

Implement a simplified Python utility that given a list of Parquet file footers (represented as dicts with 'min','max' for a column) and a predicate value, returns the subset of files that must be scanned. Input example:

files = [{'path':'f1','min':1,'max':10},{'path':'f2','min':11,'max':20}]predicate = ('col','>=',12)

Write select_files(files, predicate). Provide Python code.

HardSystem Design

0 practiced

Design a disaster recovery plan for a data lakehouse spanning multiple regions. Cover backup frequency, cross-region replication of metadata and data, RTO/RPO targets, and a tested failover procedure that ensures data consistency and minimal BI downtime.

EasyTechnical

0 practiced

What metrics would you present to leadership to justify investment in a new lakehouse feature (e.g., adoption of Iceberg or Delta)? Provide 6 KPIs (quantitative and qualitative) that show business and technical value.

MediumSystem Design

0 practiced

Design a high-level architecture for ingesting both batch CSV files and real-time events into a unified analytics layer (lake or lakehouse). Include components for: ingestion, storage, metadata cataloging, transformation/compute, and how you'd support both ad-hoc BI and model training. Specify cloud services or OSS tools you would use (e.g., S3, Kafka, Spark, Flink, Iceberg, Glue).

HardTechnical

0 practiced

Provide a step-by-step design to scale metadata for a table that currently has 10M files and is causing planning timeouts. Include ideas like manifest files, partition pruning, hierarchical partitioning, precomputed manifests per partition, and leveraging table-format metadata. Explain impact on job planning and S3 listing costs.

Unlock Full Question Bank

Get access to hundreds of Data Lake and Warehouse Architecture interview questions and detailed answers.

Join thousands of developers preparing for their dream job.