High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

HardTechnical

0 practiced

A region suffers both compute and network failure and your automated failover did not execute. As incident commander, provide your playbook steps for immediate containment, customer communications, cross-team triage, temporary mitigations to restore partial service, and post-incident actions including runbook updates.

EasyTechnical

0 practiced

Compare active-active and active-passive deployment patterns. For each pattern list typical failure modes, how traffic and reads/writes are routed, and give a short example for a web application with stateless frontends and a stateful backing store.

MediumTechnical

0 practiced

Compare synchronous versus asynchronous replication for stateful services. For each approach discuss effects on RPO, RTO, write latency, throughput, and failure scenarios that could cause data loss or unavailability.

HardTechnical

0 practiced

Discuss trade-offs between synchronous cross-region replication and asynchronous replication with near-synchronous techniques (for example semi-synchronous or quorum-based writes). Focus on latency impact, throughput, RPO/RTO implications, operational complexity, and costs.

HardSystem Design

0 practiced

Architect a globally distributed stateful service that must achieve 99.999% availability and an RPO under 1 second. Describe region placement, replication topology (sync/async/quorum), consistency model, leader election approach, split-brain prevention, and estimate expected cost and operational complexity.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Join thousands of developers preparing for their dream job.