Batch and Stream Processing Questions

Covers design and implementation of data processing using batch, stream, or hybrid approaches. Candidates should be able to explain when to choose batch versus streaming based on latency, throughput, cost, data volume, and business requirements, and compare architectural patterns such as lambda and kappa. Core stream concepts include event time versus processing time, windowing strategies such as tumbling sliding and session windows, watermarks and late arrivals, event ordering and out of order data handling, stateful versus stateless processing, state management and checkpointing, and delivery semantics including exactly once and at least once. Also includes knowledge of streaming and batch engines and runtimes, connector patterns for sources and sinks, partitioning and scaling strategies, backpressure and flow control, idempotency and deduplication techniques, testing and replayability, monitoring and alerting, and integration with storage layers such as data lakes and data warehouses. Interview focus is on reasoning about correctness latency cost and operational complexity and on concrete architecture and tooling choices.

MediumTechnical

81 practiced

Design a monitoring and alerting strategy for a streaming ML inference pipeline. List key system and model metrics (e.g., processing latency, watermark lag, throughput, error rates, state size, model accuracy, prediction distribution), threshold policies, dashboards, and runbook actions for common failure modes such as backlog, increased latency, or model drift.

HardTechnical

73 practiced

Design a streaming deduplication approach for a pipeline where event IDs are not unique until an expensive enrichment step creates a deterministic ID. Enrichment is costly and events can be out-of-order. Propose a pipeline that minimizes redundant enrichment work while guaranteeing deduplication correctness, including state structures, probabilistic pre-filters, and trade-offs in false positives/negatives.

HardSystem Design

133 practiced

Design a stateful online feature store for multi-tenant ML workloads that supports per-entity TTL, cross-region replication, sub-15ms read latency, and optional strong consistency. Discuss storage backend choices (e.g., DynamoDB, Redis, RocksDB+local cache), indexing strategies, replication (sync vs async), conflict resolution, and hot/cold tiering.

MediumTechnical

69 practiced

Describe connector strategies for writing streaming predictions or features to warehouses like BigQuery or Snowflake. Discuss batching windows, write amplification, idempotency concerns, cost-control techniques, and schema/partitioning considerations for efficient downstream analytics and training.

EasyTechnical

74 practiced

Describe stateless versus stateful stream processing. Give an example of a stateful operator used for an online ML feature (such as a per-user rolling average) and explain how you would persist and checkpoint that state for fault tolerance and fast recovery.

Unlock Full Question Bank

Get access to hundreds of Batch and Stream Processing interview questions and detailed answers.

Join thousands of developers preparing for their dream job.