InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

MediumTechnical
0 practiced
Design a redundant load balancer pattern for high availability. Discuss active-passive floating IPs, active-active with anycast, and managed cloud LBs. Explain differences between L4 and L7 approaches and how each affects failover time, session affinity, and operational complexity.
MediumTechnical
0 practiced
Plan a disaster recovery drill for cross-region failover. Define objectives, scope (partial vs full), success criteria (RTO, RPO, customer impact), stakeholders to involve, data integrity and functional checks, rollback plan, and a post-mortem checklist.
EasyTechnical
0 practiced
Compare backup, snapshot, and replication as recovery strategies. For a stateful service where up to 1 hour of data loss is acceptable, recommend a hybrid approach and describe the restore process and validation checks you'd perform after restore.
HardSystem Design
0 practiced
Design an orchestration system to perform coordinated failover of hundreds of microservices from a primary region to a DR region. Include dependency ordering, canary traffic shaping, database role changes, configuration synchronization, monitoring validation gates, and rollback procedures in case of partial failures.
HardSystem Design
0 practiced
Design network redundancy for cross-region traffic using multiple transit providers, BGP failover, and private links. Discuss detection and failover mechanisms, route convergence characteristics, how to test provider-level failover, and automation to orchestrate provider switchovers.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.