InterviewStack.io LogoInterviewStack.io

Data Architecture and Pipelines Questions

Designing data storage, integration, and processing architectures. Topics include relational and NoSQL database design, indexing and query optimization, replication and sharding strategies, data warehousing and dimensional modeling, ETL and ELT patterns, batch and streaming ingestion, processing frameworks, feature stores, archival and retention strategies, and trade offs for scale and latency in large data systems.

HardTechnical
46 practiced
Describe privacy-preserving techniques you would incorporate into ML data pipelines when working with PII: anonymization/pseudonymization, differential privacy for aggregate statistics, secure multi-party computation (SMPC), and federated learning. Discuss the trade-offs in utility, complexity, and deployment for each approach.
EasyTechnical
42 practiced
What is a materialized view and how can materialized views be used to accelerate feature computation and analytics queries in a data warehouse? Provide examples of when to refresh materialized views incrementally versus full refresh and how to manage staleness trade-offs.
MediumTechnical
82 practiced
Implement a Python async function that merges two timestamp-ordered async generators of events into a single time-ordered async generator. Ensure the implementation is memory-bounded (doesn't buffer entire streams) and handles cases where one stream lags or ends. Provide pseudocode that could be adapted to real async iterators used in streaming ingestion.
MediumTechnical
41 practiced
Design an incremental and backfill-friendly ETL strategy for features computed over a 365-day sliding window, ensuring efficient reprocessing after late-arriving data or corrections. Describe how to store partial aggregates, how to merge updates, and how to minimize recomputation and storage while maintaining correctness.
MediumTechnical
45 practiced
Implement pseudocode for a streaming deduplication operator that removes duplicate events based on event_id within a sliding time window. The operator should be memory-bounded by expiring old IDs and should handle watermark-based expiration. Provide state management and eviction logic suitable for Flink-style processing.

Unlock Full Question Bank

Get access to hundreds of Data Architecture and Pipelines interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.