High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

MediumSystem Design

67 practiced

Describe how to build an automated failover orchestration system for network components and services. Cover failure detection, decision logic, runbook automation, circuit-breakers, safe rollback, human-in-the-loop escalation, and measures to prevent cascading failures. Explain how to make orchestration idempotent and auditable.

MediumTechnical

67 practiced

Design a campus network redundancy plan using link aggregation (LACP), diverse physical paths, and routing protocols (like OSPF) to minimize downtime when a switch or link fails. Explain how MLAG or stacked designs affect redundancy, where single points of failure remain, and how to test failover safely in production.

EasyTechnical

92 practiced

Explain what leader election is and why it is important for distributed network controllers (for example SDN controllers, HA load-balancer controllers). Describe two leader election approaches (for example: Raft-style consensus and simple heartbeat+priority with fencing) and the high-level trade-offs in terms of complexity, convergence time, and safety.

HardTechnical

93 practiced

Case study: A whole cloud region suffers a power and network outage lasting six hours and affects your primary region. You are the on-call network engineer. Describe in detail your incident response: initial triage steps, invoking the DR runbook, failing over traffic, ensuring critical services remain secure, validating integrity of replicated state, rollback planning when primary returns, stakeholder communication, and post-incident RCA and remediation items. Specify metrics you would report during and after the incident.

HardTechnical

89 practiced

Given a consistent hashing ring configured with virtual nodes and replication factor 3 across 200 cache nodes, analyze the impact of losing 5% of nodes on key movement, cache miss amplification, and rebalance traffic. Provide formulas or approximations for expected fraction of keys that move, expected increase in miss rate, and mitigation strategies to reduce client impact during rebalance.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Join thousands of developers preparing for their dream job.