Infrastructure Scaling and Capacity Planning Questions

Operational and infrastructure level planning to ensure systems meet current demand and projected growth. Topics include forecasting demand headroom planning and three to five year capacity roadmaps; autoscaling policies and metrics driven scaling using central processing unit memory and custom application metrics; load testing benchmarking and performance validation methodologies; cost modeling and right sizing in cloud environments and trade offs between managed services and self hosted solutions; designing non disruptive upgrade and migration strategies; multi region and availability zone deployment strategies and implications for data placement and latency; instrumentation and observability for capacity metrics; and mapping business growth projections into infrastructure acquisition and scaling decisions. Candidates should demonstrate how to translate requirements into capacity plans and how to validate assumptions with experiments and measurements.

HardSystem Design

0 practiced

Plan a rolling, non-disruptive upgrade strategy for a large Spark fleet upgrading across major versions (e.g., Spark 2.x → 3.x) that runs dozens of ETL jobs. Address compatibility testing, ability to run mixed-version clusters, shuffle file format changes, Hive Metastore compatibility, job routing to specific versions, smoke/validation suites, and rollback procedures.

HardSystem Design

0 practiced

Design an observability pipeline and architecture capable of ingesting, storing, and querying high-cardinality capacity and application metrics at scale to support long-term capacity planning. Discuss ingestion (agents vs push), a TSDB vs data lake for metrics storage, downsampling/rollups and when to apply them, label cardinality controls, query patterns required by capacity planners, retention tiers, cost implications, and ensuring metric availability during incidents.

HardSystem Design

0 practiced

Design an Infrastructure as Code and CI/CD pipeline that enables safe, auditable capacity changes (vertical and horizontal) to production clusters. Describe pipeline stages (plan, validation, canary infra, capacity smoke tests), gating and approval for risky changes, automated rollback, drift detection, and audit logging. Explain how to run capacity-affecting changes safely in production and how to verify actual impact.

MediumSystem Design

0 practiced

You must validate cluster sizing for Spark jobs that process 10 TB of daily data and should complete within 2 hours. Design a load-testing and benchmarking plan listing representative job variants, dataset sampling strategy, executor/core/memory configs to test, shuffle/partition tuning to vary, and key metrics to collect (shuffle I/O, GC, CPU, task skew). Explain how to convert test results into a production cluster sizing and scheduling plan that supports concurrency.

HardSystem Design

0 practiced

Architect a cross-region data placement and replication strategy that provides low-latency reads for EU and US customers while meeting data residency (GDPR-like) constraints and minimizing cross-region egress costs. Discuss strategies for selective replication, partitioning, encryption and KMS key separation, access control, and logging/audit to prove compliance.

Unlock Full Question Bank

Get access to hundreds of Infrastructure Scaling and Capacity Planning interview questions and detailed answers.

Join thousands of developers preparing for their dream job.