Batch and Stream Processing Questions

Covers design and implementation of data processing using batch, stream, or hybrid approaches. Candidates should be able to explain when to choose batch versus streaming based on latency, throughput, cost, data volume, and business requirements, and compare architectural patterns such as lambda and kappa. Core stream concepts include event time versus processing time, windowing strategies such as tumbling sliding and session windows, watermarks and late arrivals, event ordering and out of order data handling, stateful versus stateless processing, state management and checkpointing, and delivery semantics including exactly once and at least once. Also includes knowledge of streaming and batch engines and runtimes, connector patterns for sources and sinks, partitioning and scaling strategies, backpressure and flow control, idempotency and deduplication techniques, testing and replayability, monitoring and alerting, and integration with storage layers such as data lakes and data warehouses. Interview focus is on reasoning about correctness latency cost and operational complexity and on concrete architecture and tooling choices.

EasyTechnical

89 practiced

Explain idempotency and deduplication techniques for streaming data. Discuss when to use strict deduplication by event ID, sequence-number enforcement, stateful TTL-based dedup, and when approximate methods such as Bloom filters are acceptable — particularly in the context of training data correctness for AI models.

EasyTechnical

92 practiced

Describe tumbling, sliding, and session windows in stream processing. For each type, explain typical AI/ML use-cases (for example: rolling feature averages, periodic aggregations, sessionization for user behavior) and explain how trigger policies and allowed lateness change emitted results and state retention.

HardTechnical

132 practiced

Your streaming job maintains per-user feature state for 500M users (state size in hundreds of GBs to TBs). Explain choices for state backend (in-memory, RocksDB local, external store), checkpointing strategy (incremental, frequency), compaction, TTL, and how to rescale or migrate state with minimal downtime. Include operational trade-offs.

MediumTechnical

88 practiced

Implement an efficient deduplication strategy in Python pseudocode using Apache Beam (or equivalent) that removes duplicate events based on event_id within a 24-hour sliding window while remaining memory efficient. Describe how you would use keyed state and timers, and how you would handle scale and restarts.

EasyTechnical

80 practiced

Explain the lambda and kappa architectures. For an AI platform that needs both low-latency features and accurate historical aggregates for model training, which architecture would you recommend and why? Discuss trade-offs in operational complexity and reprocessing.

Unlock Full Question Bank

Get access to hundreds of Batch and Stream Processing interview questions and detailed answers.

Join thousands of developers preparing for their dream job.