InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

HardTechnical
122 practiced
A startup cannot afford full multi-region active-active deployment but must meet RTO of 1 hour and RPO of 15 minutes for customer data. Design a cost-optimized DR plan including architecture choices (warm standby, cross-region replication), failover automation steps, testing cadence, and a high-level monthly cost trade-off explanation.
MediumTechnical
67 practiced
Design a disaster recovery testing program that satisfies regulatory compliance while minimizing customer impact. Include frequency and types of drills (tabletop, partial failover, full failover), success criteria, rollback plans, stakeholder participation, and automation that can be used to scale tests safely.
EasyTechnical
69 practiced
Compare active-active and active-passive deployment patterns for a service. For each pattern, list pros and cons in terms of availability, consistency, operational complexity, and cost. Give two real-world scenarios where you would pick active-active and two where you would pick active-passive.
EasyTechnical
110 practiced
Define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) in the context of a customer-facing web service. For a service with a 99.95% availability target, give concrete example values for acceptable RTO and RPO (monthly and yearly), explain how you would measure them in production, and describe architectural implications of those targets.
EasyTechnical
64 practiced
Describe DNS failover strategies and how TTL, provider health checks, and DNS caching affect failover speed and risk of split traffic. Include trade-offs between low TTLs and DNS query load, and describe how DNS CNAME chains or global load balancers change the approach.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.