InterviewStack.io LogoInterviewStack.io

Batch and Stream Processing Questions

Covers design and implementation of data processing using batch, stream, or hybrid approaches. Candidates should be able to explain when to choose batch versus streaming based on latency, throughput, cost, data volume, and business requirements, and compare architectural patterns such as lambda and kappa. Core stream concepts include event time versus processing time, windowing strategies such as tumbling sliding and session windows, watermarks and late arrivals, event ordering and out of order data handling, stateful versus stateless processing, state management and checkpointing, and delivery semantics including exactly once and at least once. Also includes knowledge of streaming and batch engines and runtimes, connector patterns for sources and sinks, partitioning and scaling strategies, backpressure and flow control, idempotency and deduplication techniques, testing and replayability, monitoring and alerting, and integration with storage layers such as data lakes and data warehouses. Interview focus is on reasoning about correctness latency cost and operational complexity and on concrete architecture and tooling choices.

EasyTechnical
89 practiced
Explain idempotency and deduplication techniques for streaming data. Discuss when to use strict deduplication by event ID, sequence-number enforcement, stateful TTL-based dedup, and when approximate methods such as Bloom filters are acceptable — particularly in the context of training data correctness for AI models.
EasyTechnical
92 practiced
Describe tumbling, sliding, and session windows in stream processing. For each type, explain typical AI/ML use-cases (for example: rolling feature averages, periodic aggregations, sessionization for user behavior) and explain how trigger policies and allowed lateness change emitted results and state retention.
HardTechnical
132 practiced
Your streaming job maintains per-user feature state for 500M users (state size in hundreds of GBs to TBs). Explain choices for state backend (in-memory, RocksDB local, external store), checkpointing strategy (incremental, frequency), compaction, TTL, and how to rescale or migrate state with minimal downtime. Include operational trade-offs.
MediumTechnical
88 practiced
Implement an efficient deduplication strategy in Python pseudocode using Apache Beam (or equivalent) that removes duplicate events based on event_id within a 24-hour sliding window while remaining memory efficient. Describe how you would use keyed state and timers, and how you would handle scale and restarts.
EasyTechnical
80 practiced
Explain the lambda and kappa architectures. For an AI platform that needs both low-latency features and accurate historical aggregates for model training, which architecture would you recommend and why? Discuss trade-offs in operational complexity and reprocessing.

Unlock Full Question Bank

Get access to hundreds of Batch and Stream Processing interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.