Capacity Planning and Resource Optimization Questions

Covers forecasting, provisioning, and operating compute, memory, storage, and network resources efficiently to meet demand and service level objectives. Key skills include monitoring resource utilization metrics such as central processing unit usage, memory consumption, storage input and output and network throughput; analyzing historical trends and workload patterns to predict future demand; and planning capacity additions, safety margins, and buffer sizing. Candidates should understand vertical versus horizontal scaling, autoscaling policy design and cooldowns, right sizing instances or containers, workload placement and isolation, load balancing algorithms, and use of spot or preemptible capacity for interruptible workloads. Practical topics include storage planning and archival strategies, database memory tuning and buffer sizing, batching and off peak processing, model compression and inference optimization for machine learning workloads, alerts and dashboards, stress and validation testing of planned changes, and methods to measure that capacity decisions meet both performance and cost objectives.

MediumTechnical

0 practiced

Explain how to use Infrastructure as Code (Terraform or CloudFormation) to provision repeatable capacity environments for performance testing and canary rollouts. Discuss parameterization for instance types, tagging, versioning, and automation for provisioning and teardown to avoid orphaned costly resources.

EasyTechnical

0 practiced

Describe three load-balancing algorithms used for distributing inference requests (round-robin, least-connections, weighted-response-time). For each: describe how it works, a scenario where it performs well for ML serving, and one potential drawback specific to ML workloads.

HardSystem Design

0 practiced

Design a monitoring and alerting scheme to detect degraded ML inference performance caused by resource starvation (CPU/GPU/memory) versus model drift or data skew. List signals to capture, how to create composite alerts to reduce noise, suggested thresholds, and runbook actions for each alert type.

MediumTechnical

0 practiced

Write a Python function that estimates the number of replicas required for an inference service given desired throughput T (req/s), per-replica max_concurrency C (requests served concurrently), and a safety buffer B (expressed as a decimal, e.g., 0.2 for 20%). Assume linear scaling and no queuing delays. Function signature: def estimate_replicas(T, C, B): return int. Include handling of edge cases and brief explanation of assumptions.

MediumTechnical

0 practiced

You must reduce inference cost per request by 30% without degrading accuracy beyond an agreed threshold. Propose a prioritized action plan including model-level changes (e.g., distillation), serving optimizations (batching, caching), and infra changes (right-sizing, spot). Explain how you would measure and validate the impact of each step.

Unlock Full Question Bank

Get access to hundreds of Capacity Planning and Resource Optimization interview questions and detailed answers.

Join thousands of developers preparing for their dream job.