InterviewStack.io LogoInterviewStack.io

Data Architecture and Pipelines Questions

Designing data storage, integration, and processing architectures. Topics include relational and NoSQL database design, indexing and query optimization, replication and sharding strategies, data warehousing and dimensional modeling, ETL and ELT patterns, batch and streaming ingestion, processing frameworks, feature stores, archival and retention strategies, and trade offs for scale and latency in large data systems.

MediumSystem Design
43 practiced
Design an offline and online feature store architecture that supports both large-scale offline feature materialization for training and low-latency (<10ms) online lookups for serving. Describe storage technologies for offline and online, metadata/registry design, consistency and freshness guarantees, update patterns (batch vs streaming), and API design for training and serving clients.
EasyTechnical
51 practiced
Explain Change Data Capture (CDC) and how you would use CDC to keep a feature store synchronized with an OLTP database. Include components you would use (e.g., Debezium, Kafka), how to handle schema changes, idempotency, and strategies for dealing with out-of-order or late events when computing features.
HardTechnical
50 practiced
Image reads from S3 are the bottleneck for your multi-GPU training jobs: GPUs idle waiting for IO. Propose concrete optimizations across storage, network, and the training pipeline to increase throughput and GPU utilization. Include changes like file format, caching, prefetch, local SSD staging, and data parallelism considerations.
HardSystem Design
47 practiced
Architect a multi-tenant ML data platform that supports tenant isolation, per-tenant quotas on storage/compute, dataset-level access controls, and cost attribution. Describe how namespaces, authentication/authorization, RBAC, resource enforcement (e.g., Kubernetes quotas), and billing metering would be implemented. Include ideas for cross-tenant dataset sharing and auditing.
MediumTechnical
53 practiced
Create a pre-training dataset validation checklist for a dataset before it's allowed to be used to train a production model. Include steps for schema validation, label quality checks, distribution and missing value checks, bias and fairness tests, and required documentation/artifacts for sign-off.

Unlock Full Question Bank

Get access to hundreds of Data Architecture and Pipelines interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.