InterviewStack.io LogoInterviewStack.io

Batch and Stream Processing Questions

Covers design and implementation of data processing using batch, stream, or hybrid approaches. Candidates should be able to explain when to choose batch versus streaming based on latency, throughput, cost, data volume, and business requirements, and compare architectural patterns such as lambda and kappa. Core stream concepts include event time versus processing time, windowing strategies such as tumbling sliding and session windows, watermarks and late arrivals, event ordering and out of order data handling, stateful versus stateless processing, state management and checkpointing, and delivery semantics including exactly once and at least once. Also includes knowledge of streaming and batch engines and runtimes, connector patterns for sources and sinks, partitioning and scaling strategies, backpressure and flow control, idempotency and deduplication techniques, testing and replayability, monitoring and alerting, and integration with storage layers such as data lakes and data warehouses. Interview focus is on reasoning about correctness latency cost and operational complexity and on concrete architecture and tooling choices.

MediumTechnical
88 practiced
Implement an efficient deduplication strategy in Python pseudocode using Apache Beam (or equivalent) that removes duplicate events based on event_id within a 24-hour sliding window while remaining memory efficient. Describe how you would use keyed state and timers, and how you would handle scale and restarts.
HardTechnical
90 practiced
Design a robust sessionization pipeline where sessions are defined by 30 minutes of inactivity. Events can arrive out-of-order and be delayed by several hours. Describe watermark strategy, allowed lateness, state retention, approach to emitting provisional results and final corrections, and how to notify downstream consumers of late corrections.
MediumTechnical
64 practiced
Compare Spark Structured Streaming (micro-batch) and Flink (true streaming) with respect to event-time semantics, latency, state management, and their suitability for low-latency AI feature serving. Provide scenarios where each framework would be preferable.
EasyTechnical
78 practiced
What strategies would you use to test and enable replayability for streaming pipelines? Cover unit testing of transforms, integration testing with simulated event-time, end-to-end replay from raw event store for reprocessing training data, and validation of recomputed outputs.
HardTechnical
92 practiced
Compare cloud-managed streaming services (for example: Pub/Sub+Dataflow, Kinesis+Kinesis Data Analytics, Managed Kafka+Managed Flink) versus self-managing Kafka + Flink on VMs. Evaluate in terms of operational complexity, ability to customize for performance (including GPU support), cost, and long-term maintainability for an AI platform.

Unlock Full Question Bank

Get access to hundreds of Batch and Stream Processing interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.