InterviewStack.io LogoInterviewStack.io

Infrastructure Scaling and Capacity Planning Questions

Operational and infrastructure level planning to ensure systems meet current demand and projected growth. Topics include forecasting demand headroom planning and three to five year capacity roadmaps; autoscaling policies and metrics driven scaling using central processing unit memory and custom application metrics; load testing benchmarking and performance validation methodologies; cost modeling and right sizing in cloud environments and trade offs between managed services and self hosted solutions; designing non disruptive upgrade and migration strategies; multi region and availability zone deployment strategies and implications for data placement and latency; instrumentation and observability for capacity metrics; and mapping business growth projections into infrastructure acquisition and scaling decisions. Candidates should demonstrate how to translate requirements into capacity plans and how to validate assumptions with experiments and measurements.

HardSystem Design
0 practiced
Design an autoscaling control loop that accepts SLOs (e.g., 99th percentile latency target) and budget constraints as inputs, and makes scaling decisions to meet SLOs while minimizing cost. Describe the control strategy (feedback control, rules, or ML), safety limits, how you detect runaway cost conditions, and failure modes to handle.
HardTechnical
0 practiced
Discuss trade-offs of using spot/preemptible instances for batch and fault-tolerant workloads versus on-demand instances for latency-sensitive services. Propose fallback strategies for spot interruptions such as checkpointing, mixed instance groups, or fallback to on-demand, and how to quantify risk vs cost savings.
MediumTechnical
0 practiced
Given table 'http_requests' with schema: id BIGINT, service_name TEXT, latency_ms INT, occurred_at TIMESTAMP, write a PostgreSQL query that computes per-service p95 (95th percentile) latency for the last 24 hours. Provide a query using percentile_cont or percentile_disc and explain indexes or partitioning you would add for performance on large tables.
HardSystem Design
0 practiced
Define a non-disruptive rolling upgrade strategy for a stateful service that runs leader election and has a strict p99 latency SLA. Describe the sequence of steps, leader handover process, validation checks during upgrade, and how you would test the strategy before production rollout.
HardTechnical
0 practiced
You want to compute a custom application-level metric (e.g., tail latency per user action) from distributed tracing data and use it for autoscaling decisions. Describe a reliable, low-latency pipeline to compute, aggregate, and ship this metric to your autoscaler: choices for collection, aggregation windows, handling dropped traces, smoothing, and how to prevent the metric pipeline from causing noisy scaling.

Unlock Full Question Bank

Get access to hundreds of Infrastructure Scaling and Capacity Planning interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.