Data Architecture and Pipelines Questions
Designing data storage, integration, and processing architectures. Topics include relational and NoSQL database design, indexing and query optimization, replication and sharding strategies, data warehousing and dimensional modeling, ETL and ELT patterns, batch and streaming ingestion, processing frameworks, feature stores, archival and retention strategies, and trade offs for scale and latency in large data systems.
MediumTechnical
0 practiced
Write a Python script (pseudocode is fine) that reads daily CSV logs from s3://my-bucket/raw/YYYY-MM-DD/*.csv, infers a schema, writes compressed Parquet files partitioned by date (YYYY-MM-DD) into s3://my-bucket/processed/, and handles corrupted rows by logging them to an errors path. Use PyArrow or pandas libraries and mention how you'd integrate schema checks and safety for concurrent uploads.
EasyTechnical
0 practiced
What is a feature store and why is it important for reproducible machine learning? As a data scientist, describe the key responsibilities of a feature store (online vs offline store), what metadata you would expect, and how it supports consistent training and serving.
MediumSystem Design
0 practiced
You must keep a near-real-time copy of an orders table stored in an OLTP RDBMS into the analytics warehouse for feature generation. Design a CDC-based pipeline: pick technologies, describe initial snapshot vs incremental sync, how to handle DDL/schema changes, ordering guarantees for multi-table transactions, and how to ensure idempotent writes in the destination.
MediumTechnical
0 practiced
Queries on a frequently used fact table are slow despite partitioning. As a data scientist collaborating with engineers, outline a diagnostic and optimization plan: what metrics and metadata you would examine (table stats, partition pruning, execution plan), indexing or clustering options, rewriting queries, and when to introduce pre-aggregated tables or materialized views.
HardTechnical
0 practiced
You must implement exactly-once semantics for a streaming aggregation pipeline that computes features and writes to an online store. Describe how you would achieve exactly-once with Apache Flink or Kafka Streams, and detail strategies for sinks that are not idempotent (e.g., external databases).
Unlock Full Question Bank
Get access to hundreds of Data Architecture and Pipelines interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.