InterviewStack.io LogoInterviewStack.io

Data Pipeline Scalability and Performance Questions

Design data pipelines that meet throughput and latency targets at large scale. Topics include capacity planning, partitioning and sharding strategies, parallelism and concurrency, batching and windowing trade offs, network and I O bottlenecks, replication and load balancing, resource isolation, autoscaling patterns, and techniques for maintaining performance as data volume grows by orders of magnitude. Include approaches for benchmarking, backpressure management, cost versus performance trade offs, and strategies to avoid hot spots.

EasyTechnical
61 practiced
Explain backpressure in streaming systems. Describe at least three mechanisms frameworks or architectures use (for example credit-based flow control, bounded buffers with drop policies, and upstream rate-limiting), when to apply each, and how backpressure impacts upstream latency, throughput, and system stability.
MediumTechnical
29 practiced
Write a SQL query (Postgres or BigQuery) to compute user sessions from events table events(user_id INT, event_time TIMESTAMP). Define a session as consecutive events for a user where the gap between events is <= 30 minutes. The output should be (session_id, user_id, session_start, session_end, event_count). Provide the SQL and explain the window functions used and how this scales to billions of rows.
EasyTechnical
31 practiced
Implement a sliding-window event counter in Python. You will receive a stream of events: (user_id: int, ts: int) where ts is epoch seconds. Implement class SlidingWindowCounter(window_seconds: int) with methods add_event(user_id, ts) and query(user_id, ts) that returns the number of events for that user in the inclusive interval [ts - window_seconds + 1, ts]. Assume single-process memory and aim for amortized O(1) per operation; provide working code and explain complexity.
MediumTechnical
31 practiced
Design a simple watermarking strategy in pseudocode to handle out-of-order events arriving up to 2 minutes late. Specify how you compute and emit watermarks, how you handle late events (e.g., side outputs or updates), and how watermark policy affects state eviction for windowed aggregations.
HardTechnical
37 practiced
Design a consumer-group partition-assignment algorithm for a streaming framework that adapts to heterogeneous consumer resources (CPU/memory) and minimizes rebalance churn. Describe the assignment scoring function, data structures, how to handle joins/leaves of consumers incrementally, and complexity analysis.

Unlock Full Question Bank

Get access to hundreds of Data Pipeline Scalability and Performance interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.