Data Pipeline Scalability and Performance Questions
Design data pipelines that meet throughput and latency targets at large scale. Topics include capacity planning, partitioning and sharding strategies, parallelism and concurrency, batching and windowing trade offs, network and I O bottlenecks, replication and load balancing, resource isolation, autoscaling patterns, and techniques for maintaining performance as data volume grows by orders of magnitude. Include approaches for benchmarking, backpressure management, cost versus performance trade offs, and strategies to avoid hot spots.
HardTechnical
0 practiced
You observe excessive skew: 1% of users produce 60% of events causing hot Kafka partitions and downstream CPU/memory contention. Propose immediate mitigation steps to reduce impact in production and long-term architectural changes to prevent recurrence. Consider topic redesign, key salting, tiered processing, dedicated pipelines for heavy users, and ordering requirements.
EasyTechnical
0 practiced
Explain resource isolation techniques and why they matter for ML pipelines that mix CPU-heavy transformations, GPU training jobs, and IO-heavy ingestion. Include approaches such as node pools, cgroups, Kubernetes resource limits/requests, GPU isolation, burstable nodes, and quality of service classes.
MediumTechnical
0 practiced
Design a streaming join between click events and user profile updates where click events can arrive late and profile updates can be retracted (deleted or corrected). Describe how you'd handle event-time semantics, watermarks, allowed lateness, retractions, and how to ensure the joined features remain correct for both online and offline training pipelines.
HardSystem Design
0 practiced
Design a cross-region streaming replication solution that enables low-latency regional reads and supports global fault tolerance. Requirements: replication lag typically <5s, support regional read locality, and tolerate regional failures. Discuss active-passive vs active-active topologies, conflict resolution, metadata propagation, and tools such as MirrorMaker, Confluent Replicator, or custom replication.
MediumSystem Design
0 practiced
Design an offline ETL pipeline that ingests 5 TB/day of raw events into a training dataset. Requirements: support daily reprocessing, deterministic outputs for reproducibility, deduplication, strong data quality checks, and efficient storage for analytical queries. Describe components, orchestration, file formats, partitioning scheme, and strategies for handling retries and backfills.
Unlock Full Question Bank
Get access to hundreds of Data Pipeline Scalability and Performance interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.