Real Time and Batch Ingestion Questions

Focuses on choosing between batch ingestion and real time streaming for moving data from sources to storage and downstream systems. Topics include latency and throughput requirements, cost and operational complexity, consistency and delivery semantics such as at least once and exactly once, idempotent and deduplication strategies, schema evolution, connector and source considerations, backpressure and buffering, checkpointing and state management, and tooling choices for streaming and batch. Candidates should be able to design hybrid architectures that combine streaming for low latency needs with batch pipelines for large backfills or heavy aggregations and explain operational trade offs such as monitoring, scaling, failure recovery, and debugging.

HardSystem Design

135 practiced

Design observability and alerting for a mission-critical ingestion stream that feeds real-time fraud detection. Define the SLOs, key observability signals (both infra and data correctness), alert priority levels, and what automated remediation (if any) you would allow.

MediumTechnical

90 practiced

A pipeline writes raw events into S3, then Spark jobs convert them to Parquet. Storage costs are ballooning. Propose cost optimization strategies across ingestion, storage formats, retention policies, and compute choices while retaining ability to reprocess and meet SLA for recent analytics.

HardTechnical

88 practiced

You are asked to migrate a legacy nightly batch ingestion for financial transactions to a streaming-first pipeline while preserving strong auditing, traceability, and the ability to replay historical data. Outline a phased migration plan including backward compatibility, dual-writing, validation checkpoints, stakeholders to involve, and rollback criteria.

MediumTechnical

102 practiced

Explain the concept of backpressure in streaming systems. Describe concrete buffering strategies and queueing mechanisms to handle producer bursts, and how a streaming engine should react to slow downstream sinks. Mention tools or primitives available in Kafka, Flink, and Spark to manage backpressure.

HardTechnical

130 practiced

Describe end-to-end design approaches to achieve exactly-once semantics when producing events to Kafka, processing them with Spark Structured Streaming, and writing results to an external sink like a relational database or S3. Explain limitations and practical workarounds where end-to-end transactions are not available.

Unlock Full Question Bank

Get access to hundreds of Real Time and Batch Ingestion interview questions and detailed answers.

Join thousands of developers preparing for their dream job.