InterviewStack.io LogoInterviewStack.io

Capacity Planning and Resource Optimization Questions

Covers forecasting, provisioning, and operating compute, memory, storage, and network resources efficiently to meet demand and service level objectives. Key skills include monitoring resource utilization metrics such as central processing unit usage, memory consumption, storage input and output and network throughput; analyzing historical trends and workload patterns to predict future demand; and planning capacity additions, safety margins, and buffer sizing. Candidates should understand vertical versus horizontal scaling, autoscaling policy design and cooldowns, right sizing instances or containers, workload placement and isolation, load balancing algorithms, and use of spot or preemptible capacity for interruptible workloads. Practical topics include storage planning and archival strategies, database memory tuning and buffer sizing, batching and off peak processing, model compression and inference optimization for machine learning workloads, alerts and dashboards, stress and validation testing of planned changes, and methods to measure that capacity decisions meet both performance and cost objectives.

EasyTechnical
25 practiced
Describe common autoscaling strategies for ML serving: metric-based (CPU/memory), request-based (QPS), custom metrics (GPU utilization, queue length), and predictive autoscaling. Explain how cooldowns and stabilization windows influence behavior and give one example scenario where you would increase cooldowns to prevent thrashing.
HardTechnical
28 practiced
Implement a Python simulator that models autoscaler behavior over time. Input: list of request rates per second (time-series), replica startup time S seconds, per-replica capacity R requests/sec, cooldown period C seconds, min_replicas, and max_replicas. Output: time-series of replica counts over the same time steps. Focus on correctness and clear handling of startup delays and cooldowns; performance is secondary.
MediumSystem Design
28 practiced
Design an autoscaling policy for a Kubernetes-based inference service that must keep p95 latency under 200ms, sustain 500 req/s on average, and tolerate spikes up to 1000 req/s for short periods. Specify metrics, target thresholds, replica startup time assumptions, cooldowns, max/min replicas, and how to prevent oscillation during bursts.
HardTechnical
26 practiced
Design an experiment and KPI framework to evaluate moving a model's inference from CPU to GPU. Define sample size, statistical power considerations, metrics to track (latency percentiles, throughput, cost per request, error rate), and a safe rollout strategy to ensure performance and cost improvements hold in production.
HardSystem Design
25 practiced
Design a monitoring and alerting scheme to detect degraded ML inference performance caused by resource starvation (CPU/GPU/memory) versus model drift or data skew. List signals to capture, how to create composite alerts to reduce noise, suggested thresholds, and runbook actions for each alert type.

Unlock Full Question Bank

Get access to hundreds of Capacity Planning and Resource Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.