InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

HardTechnical
93 practiced
Case study: A whole cloud region suffers a power and network outage lasting six hours and affects your primary region. You are the on-call network engineer. Describe in detail your incident response: initial triage steps, invoking the DR runbook, failing over traffic, ensuring critical services remain secure, validating integrity of replicated state, rollback planning when primary returns, stakeholder communication, and post-incident RCA and remediation items. Specify metrics you would report during and after the incident.
MediumSystem Design
92 practiced
Design a multi-region disaster recovery plan for a customer-facing web application with primary in us-east-1 and warm standby in eu-west-1. Requirement: RTO 1 hour, RPO 5 minutes. Describe replication approach, networking and routing choices (DNS failover, BGP anycast, or traffic manager), certificate/key distribution, health checks, runbook steps during failover, and how session continuity or graceful degradation would be handled.
MediumTechnical
93 practiced
Design BGP multihoming for a data center to provide high-availability Internet connectivity. Discuss ASN selection, prefix announcement strategy, handling of longer path or local-pref, use of communities, BFD/keepalive timers for rapid detection, multipath considerations, and cooperation/negotiation points with upstream ISPs. Also explain how to avoid route flapping and minimize customer impact during provider failures.
EasyTechnical
70 practiced
Given availability targets 99.9%, 99.99%, and 99.999%, calculate the allowed annual and monthly downtime for each. Then describe specific network and infrastructure architecture choices and redundancy strategies you would employ to meet each target (examples: single-region multi-AZ N+1, active-active multi-region with Anycast or BGP), and outline the main cost and operational trade-offs for each availability tier.
HardSystem Design
88 practiced
Choose and tune a leader election mechanism for network controllers deployed across continents with high inter-region latencies. Compare Raft and Paxos (or an existing etcd-based approach) in terms of leader stability, election time, write latency, and complexity. Propose configuration tuning (timeouts, election backoff), and architecture patterns (regional leaders plus global coordination) to balance availability and consistency.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.