InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

MediumTechnical
68 practiced
Compare synchronous versus asynchronous replication for stateful services. For each approach discuss effects on RPO, RTO, write latency, throughput, and failure scenarios that could cause data loss or unavailability.
HardSystem Design
122 practiced
Design network redundancy for cross-region traffic using multiple transit providers, BGP failover, and private links. Discuss detection and failover mechanisms, route convergence characteristics, how to test provider-level failover, and automation to orchestrate provider switchovers.
HardTechnical
80 practiced
Explain split-brain detection and mitigation techniques including quorum-based decisions, fencing (for example STONITH), compare-and-swap leader leases, and external arbitration. For a disk-backed leader service, recommend a preferred approach and justify how it prevents stale-writer scenarios.
EasyTechnical
80 practiced
Define chaos engineering and describe a simple, low-risk experiment you could run weekly in production for a stateless service behind a load balancer to validate high availability. Explain preconditions, metrics to monitor, and rollback criteria.
HardTechnical
72 practiced
You are the author of a post-incident report for a major outage caused by a DNS misconfiguration during deployment that routed traffic to an empty pool. Outline the incident timeline, analysis steps you would take to find root cause (including logs and diffs), immediate mitigations, and long-term actions such as automation and guardrails.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.