Data Architecture and Pipelines Questions

Designing data storage, integration, and processing architectures. Topics include relational and NoSQL database design, indexing and query optimization, replication and sharding strategies, data warehousing and dimensional modeling, ETL and ELT patterns, batch and streaming ingestion, processing frameworks, feature stores, archival and retention strategies, and trade offs for scale and latency in large data systems.

HardSystem Design

0 practiced

Design a governance, lineage, and metadata system for ML training data that enables answering the question: 'Which exact raw data rows and transforms contributed to model version X?' Describe the metadata model, how provenance is captured at each pipeline stage, storage choices, and query mechanisms to reconstruct lineage and reproduce training datasets.

HardSystem Design

0 practiced

You need to implement end-to-end exactly-once processing for deriving features from an OLTP database using Debezium (CDC), Kafka, and Flink, writing results to a feature store. Describe how you would implement CDC ingestion, schema registration, event ordering, Flink state management, transactional sinks, checkpointing, and recovery to guarantee exactly-once semantics. Include trade-offs and failure scenarios.

HardSystem Design

0 practiced

Architect a vector embedding index for nearest-neighbor search at large scale: billions of vectors, sub-50ms query latency, and high update throughput. Discuss indexing choices (HNSW, IVF + PQ/OPQ), memory vs disk tiers, sharding and replication, batching queries, recall vs throughput trade-offs, and how you would support incremental updates and deletions without full reindexing.

MediumTechnical

0 practiced

Design SLAs and SLOs for an online feature serving API used by a real-time recommender. Specify which metrics you would track (p99/p95 latency, availability, freshness, error rate), realistic threshold values for each, and remediation strategies when SLOs are violated (circuit breakers, degraded-mode serving).

MediumSystem Design

0 practiced

Design an offline and online feature store architecture that supports both large-scale offline feature materialization for training and low-latency (<10ms) online lookups for serving. Describe storage technologies for offline and online, metadata/registry design, consistency and freshness guarantees, update patterns (batch vs streaming), and API design for training and serving clients.

Unlock Full Question Bank

Get access to hundreds of Data Architecture and Pipelines interview questions and detailed answers.

Join thousands of developers preparing for their dream job.