InterviewStack.io LogoInterviewStack.io

Data Architecture and Pipelines Questions

Designing data storage, integration, and processing architectures. Topics include relational and NoSQL database design, indexing and query optimization, replication and sharding strategies, data warehousing and dimensional modeling, ETL and ELT patterns, batch and streaming ingestion, processing frameworks, feature stores, archival and retention strategies, and trade offs for scale and latency in large data systems.

MediumTechnical
0 practiced
Propose partitioning strategies for a time-series events table frequently queried by user_id and time-range. Compare partition-by-date, partition-by-user-id, and hybrid strategies (date + hash on user), and discuss how each handles common queries, hotspot users, compaction, and retention.
MediumTechnical
0 practiced
Queries on a frequently used fact table are slow despite partitioning. As a data scientist collaborating with engineers, outline a diagnostic and optimization plan: what metrics and metadata you would examine (table stats, partition pruning, execution plan), indexing or clustering options, rewriting queries, and when to introduce pre-aggregated tables or materialized views.
HardSystem Design
0 practiced
Design a cost-optimized architecture to run nightly ETL that processes 100 TB/day of raw logs into feature-ready tables. Include compute provisioning strategies (on-demand vs spot/preemptible), checkpointing, job scheduling, storage tiering, and methods to make cloud spend predictable while meeting freshness SLAs.
MediumTechnical
0 practiced
You need to compute sequence-based features for users (order matters) for both training and real-time serving. How would you model and store variable-length event sequences for batch training and for online inference? Describe storage formats, APIs to expose to model code, and windowing semantics to ensure consistency between train and serve.
HardTechnical
0 practiced
A data lake has millions of small Parquet files created by many upstream jobs, causing expensive metadata and slow queries. Describe compaction strategies: batch compaction scheduling, target file sizing heuristics, safe in-place compaction vs atomic swap patterns, and metrics you would track to measure success without impacting ongoing reads and writes.

Unlock Full Question Bank

Get access to hundreds of Data Architecture and Pipelines interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.