InterviewStack.io LogoInterviewStack.io

Capacity Planning and Resource Optimization Questions

Covers forecasting, provisioning, and operating compute, memory, storage, and network resources efficiently to meet demand and service level objectives. Key skills include monitoring resource utilization metrics such as central processing unit usage, memory consumption, storage input and output and network throughput; analyzing historical trends and workload patterns to predict future demand; and planning capacity additions, safety margins, and buffer sizing. Candidates should understand vertical versus horizontal scaling, autoscaling policy design and cooldowns, right sizing instances or containers, workload placement and isolation, load balancing algorithms, and use of spot or preemptible capacity for interruptible workloads. Practical topics include storage planning and archival strategies, database memory tuning and buffer sizing, batching and off peak processing, model compression and inference optimization for machine learning workloads, alerts and dashboards, stress and validation testing of planned changes, and methods to measure that capacity decisions meet both performance and cost objectives.

HardSystem Design
0 practiced
Propose an architecture for cost-aware autoscaling that optimizes dollars per inference while respecting latency SLOs. Include how to measure per-replica cost, incorporate spot pricing and expected preemption, predict near-term demand, and implement the decision engine (rule-based vs RL). Discuss trade-offs and a rollout plan.
EasyTechnical
0 practiced
What is the purpose of stress and validation testing before changing capacity? Draft a minimal stress test plan to validate scaling a model-serving deployment from 10 to 100 replicas, including ramp patterns, metrics to collect, and exit criteria to consider the test successful.
MediumTechnical
0 practiced
Write a Python function that estimates the number of replicas required for an inference service given desired throughput T (req/s), per-replica max_concurrency C (requests served concurrently), and a safety buffer B (expressed as a decimal, e.g., 0.2 for 20%). Assume linear scaling and no queuing delays. Function signature: def estimate_replicas(T, C, B): return int. Include handling of edge cases and brief explanation of assumptions.
EasyTechnical
0 practiced
What database metrics are most relevant for an online feature store used by low-latency inference (e.g., buffer cache hit ratio, IOPS, read/write latency, replication lag)? Explain how each metric influences capacity planning and what threshold changes might trigger scaling or caching interventions.
MediumSystem Design
0 practiced
You operate a mixed CPU/GPU cluster. Describe a scheduling and workload placement strategy to maximize GPU utilization while ensuring CPU-only tasks do not preempt or fragment GPU capacity. Include Kubernetes primitives (taints/tolerations, node-affinity), bin-packing approaches, and how to handle GPU fragmentation and hot-standby resources.

Unlock Full Question Bank

Get access to hundreds of Capacity Planning and Resource Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.