High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

MediumTechnical

68 practiced

Describe leader election techniques used by distributed systems (consensus algorithms, leases, Zookeeper/etcd). Explain split-brain and present at least three mitigation strategies (quorum-based, fencing, lease expirations). When is fencing necessary versus quorum-only approaches?

MediumTechnical

69 practiced

Create a disaster recovery testing plan that includes table-top exercises, partial failover drills, and full-region failover tests. For each type specify frequency, stakeholders, success criteria, rollback processes, and how results feed into continuous improvement of runbooks and automation.

EasyTechnical

74 practiced

Explain DNS-based failover and how DNS TTL affects failover convergence time. For an authoritative TTL of 300 seconds, estimate how long clients may still route to a failed endpoint and discuss strategies (health checks, global load balancers, Anycast, CDN) to reduce client-visible downtime.

HardTechnical

71 practiced

Provide a high-level 'runbook-as-code' blueprint (YAML or pseudocode) that automates a critical failover: 1) promote DB replica, 2) update load-balancer target groups, 3) update DNS, 4) run smoke tests, 5) rollback if smoke fails. Focus on idempotency, validations at each step, and safe rollbacks; include sample commands or tool choices (Terraform, Ansible, AWS CLI) you would use.

HardSystem Design

74 practiced

You run a service composed of Redis cache, PostgreSQL primary, and S3-compatible object storage. Design a disaster recovery plan that minimizes RTO while controlling costs. Discuss replication, cache warming, failover ordering (which systems to restore first), warm vs hot standby for DB and cache, and verification steps to ensure cross-store consistency.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Join thousands of developers preparing for their dream job.