InterviewStack.io LogoInterviewStack.io

Data Pipeline Scalability and Performance Questions

Design data pipelines that meet throughput and latency targets at large scale. Topics include capacity planning, partitioning and sharding strategies, parallelism and concurrency, batching and windowing trade offs, network and I O bottlenecks, replication and load balancing, resource isolation, autoscaling patterns, and techniques for maintaining performance as data volume grows by orders of magnitude. Include approaches for benchmarking, backpressure management, cost versus performance trade offs, and strategies to avoid hot spots.

EasyTechnical
28 practiced
Implement a small Python function that computes how many single-threaded worker processes are required to meet a desired throughput. Inputs: processing_time_ms (average time to process one record in ms), desired_throughput_rps, safety_factor (e.g., 1.2). Assume each worker processes records serially and there is no I/O wait. Provide sample input: processing_time_ms=5, desired_throughput_rps=10000, safety_factor=1.2 and show expected output.
MediumTechnical
40 practiced
Implement a Python retry decorator/function that applies exponential backoff with configurable jitter to retry transient errors when sending batches to an upstream service. Parameters: max_retries, base_delay_seconds, max_delay_seconds, jitter_mode ('full' or 'equal'). Provide sample code usage and briefly explain why jitter is important.
HardTechnical
30 practiced
Data volume is forecast to grow 100x within 12 months. Create a capacity planning and technical roadmap covering immediate scaling steps, medium-term architecture changes (partitioning, compaction, tiering), cost forecasting, KPIs that trigger refactor events, and a risk-based migration timeline to avoid emergencies.
EasyTechnical
32 practiced
What is backpressure in streaming systems, and why is it important? Give two concrete mechanisms used by different frameworks (for example, Kafka consumer lag throttling, Akka Streams reactive streams, Flink's network buffers) to propagate or enforce backpressure, and explain how backpressure prevents cascading failures.
EasyTechnical
33 practiced
You operate a streaming cluster that sees predictable traffic spikes every morning between 08:00-09:00. Describe a simple autoscaling strategy (scheduled, reactive, or hybrid) to handle these spikes that balances cost and performance. Specify which metrics you would scale on, cooldown periods, and any pre-warming or schedule-based optimizations.

Unlock Full Question Bank

Get access to hundreds of Data Pipeline Scalability and Performance interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.