Covers designing and operating machine learning systems to handle growth in data volume, model complexity, and traffic. Topics include distributed training strategies such as data parallelism, model parallelism, and pipeline parallelism; coordination and orchestration approaches like parameter servers, gradient aggregation, and framework tools such as PyTorch distributed, Horovod, and TensorFlow strategies; data pipeline and I O considerations including sharding, efficient formats, preprocessing bottlenecks, streaming and batch ingestion; serving and inference scaling including model sharding, batching for throughput, autoscaling, request routing, caching, and latency versus throughput tradeoffs. Also includes monitoring, profiling, checkpointing and recovery, reproducibility, cost and resource optimization, and common bottleneck analysis across network, storage, CPU preprocessing, and accelerator utilization.
MediumSystem Design
37 practiced
Propose a model sharding strategy to train a transformer with 300B parameters across hundreds of GPUs. Discuss tensor parallelism, pipeline parallelism, ZeRO optimizer partitioning stages, activation checkpointing, micro-batching, and the communication patterns needed to keep devices utilized.
MediumTechnical
48 practiced
Compare Horovod and PyTorch DistributedDataParallel (DDP) for large-scale multi-node training. Discuss communication patterns, ease of integration with code, fault tolerance and elastic training support, performance characteristics for small vs large models, and how each integrates with job schedulers.
EasyTechnical
33 practiced
Explain trade-offs when choosing checkpoint frequency for large-scale training jobs. Discuss storage cost, recovery time after preemption, wasted compute when resuming, contention on shared object stores, and how to combine incremental checkpoints with full checkpoints. Recommend a practical policy for 24-hour jobs running on preemptible instances.
EasyTechnical
30 practiced
Describe the role of a parameter server in distributed ML training. Explain synchronous versus asynchronous updates, staleness issues, gradient aggregation, model sharding across parameter servers, and how this design compares to collective all-reduce approaches in terms of consistency and scalability.
HardTechnical
29 practiced
A training pipeline occasionally produces subtly corrupted models due to rare bit-flips in storage or network. Describe detection and mitigation strategies including end-to-end checksums, storage tier choices, model self-validation tests before promotion, defensive training-data validation, and an incident response plan to prevent silent corruption reaching production.
Unlock Full Question Bank
Get access to hundreds of AI System Scalability interview questions and detailed answers.