InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

EasyTechnical
0 practiced
Define redundancy patterns N, N+1, and N+2. Give concrete examples (e.g., web servers, load balancers, storage controllers) where each pattern is appropriate. Explain how you would size spare capacity for an expected peak load, and which metrics or alerts you would use to detect when spares are becoming insufficient.
MediumBehavioral
0 practiced
Tell me about a time you recommended a more expensive disaster-recovery approach to a customer or internal stakeholder. How did you present the technical trade-offs, quantify business risk (e.g., revenue impact, customer churn), and influence the decision? If you don't have a direct example, describe how you would approach such a conversation.
EasyTechnical
0 practiced
List and explain the essential components of a failover runbook (playbook) for a critical service: preconditions, step-by-step actions, verification/validation steps, communication templates for stakeholders/customers, and rollback procedures. Provide an example sequence of the first five steps to fail over traffic to a DR region.
MediumTechnical
0 practiced
Design an autoscaling and capacity-planning approach for a microservices platform to maintain availability during sudden 5x traffic spikes. Discuss headroom sizing, warm pools, scaling cooldowns, stateful service scaling, circuit breakers, and strategies to avoid cascading failures when downstream systems are stressed.
HardTechnical
0 practiced
Design a low-latency cache invalidation scheme across multiple regions to prevent stale reads during failover transitions. Discuss propagation protocols (pub/sub bridging, asynchronous invalidation), ordering guarantees, use of versioned keys or tombstones, and testing strategies to validate correctness under network partitions and race conditions.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.