High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

MediumTechnical

65 practiced

As a Systems Administrator, design a disaster recovery testing program for a global mid-size company. Define types of tests (tabletop, partial failover, full failover), frequency for each, success criteria and KPIs, rollback plans, and post-drill reporting and remediation processes.

MediumTechnical

78 practiced

Design network redundancy for an on-prem data center that connects to two ISPs and a cloud provider. Cover BGP basics for multi-homing, failover detection mechanisms, traffic engineering to prefer one path over another, and DDoS mitigation considerations during ISP failover.

EasyTechnical

74 practiced

Explain the trade-offs between implementing high availability within the same region (multi-AZ) versus cross-region disaster recovery. Consider latency, cost, compliance, data sovereignty, and operational complexity, and give one scenario where cross-region DR is mandatory.

EasyTechnical

74 practiced

Describe common health-check types (TCP connect, HTTP status probe, script/command) and design a practical health-check strategy for a critical internal API that must remain available across two availability zones. Include check frequency, retry thresholds, and the action to take when checks fail.

HardSystem Design

81 practiced

Design disaster recovery for a central identity provider and authentication service used across many applications. Ensure tokens remain verifiable after failover, user sessions are preserved or gracefully invalidated, key material is protected and recoverable, and logging/audit trails persist. Include offline key escrow and compliance considerations.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Join thousands of developers preparing for their dream job.