Covers designing and operating machine learning systems to handle growth in data volume, model complexity, and traffic. Topics include distributed training strategies such as data parallelism, model parallelism, and pipeline parallelism; coordination and orchestration approaches like parameter servers, gradient aggregation, and framework tools such as PyTorch distributed, Horovod, and TensorFlow strategies; data pipeline and I O considerations including sharding, efficient formats, preprocessing bottlenecks, streaming and batch ingestion; serving and inference scaling including model sharding, batching for throughput, autoscaling, request routing, caching, and latency versus throughput tradeoffs. Also includes monitoring, profiling, checkpointing and recovery, reproducibility, cost and resource optimization, and common bottleneck analysis across network, storage, CPU preprocessing, and accelerator utilization.
HardSystem Design
0 practiced
Design monitoring, SLOs, and alerting for an ML platform that serves both training and inference workloads. Define 5 SLOs (with targets) for training pipelines and 5 SLOs for inference serving, explain the metrics to measure them, and describe escalation/runbook steps when an SLO is breached.
HardSystem Design
0 practiced
You must serve an ensemble of 5 large models for each request under a 100ms tail-latency requirement. Propose an inference architecture (parallel vs sequential execution, caching, result aggregation), estimate resource needs, and describe fallback strategies if one model is slow or fails.
MediumSystem Design
0 practiced
Design a metadata and lineage tracking system for datasets and model artifacts that integrates with CI/CD. Describe the core data model (datasets, versions, transformations, model artifacts), APIs for registering artifacts, and how you enable traceability from serving predictions back to the raw data and code version.
MediumTechnical
0 practiced
Write a PySpark transformation that computes per-user daily aggregates from an events dataset while minimizing shuffle. Assume events(partitioned by event_date) with columns (user_id, event_type, value). Show how you would use partitioning and map-side combiners to reduce shuffle volume.
HardSystem Design
0 practiced
Design an autoscaling architecture to support mixed workloads on a shared cluster: (A) large synchronous training jobs that require N GPUs simultaneously, and (B) low-latency inference services with unpredictable spikey traffic. Explain scheduler decisions, preemption strategies, resource pools, and how to meet both cost and latency objectives.
Unlock Full Question Bank
Get access to hundreds of AI System Scalability interview questions and detailed answers.