InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

HardTechnical
71 practiced
Provide a high-level 'runbook-as-code' blueprint (YAML or pseudocode) that automates a critical failover: 1) promote DB replica, 2) update load-balancer target groups, 3) update DNS, 4) run smoke tests, 5) rollback if smoke fails. Focus on idempotency, validations at each step, and safe rollbacks; include sample commands or tool choices (Terraform, Ansible, AWS CLI) you would use.
MediumSystem Design
92 practiced
You run a web app on AWS using RDS. Compare a same-region multi-AZ setup versus a multi-region active-passive DR strategy. For each approach, describe failover steps for database and application, expected RTO/RPO, operational complexity, and costs. Recommend one approach and explain trade-offs.
MediumSystem Design
72 practiced
Design a redundant load-balancer architecture for a public-facing application deployed in three availability zones. Include control-plane redundancy (config sync), health-checks, TLS termination options, session affinity, and techniques to avoid single points of failure for the load balancer itself.
EasyTechnical
85 practiced
Explain active-active vs active-passive deployment patterns. For each pattern, provide a concrete example of traffic routing, state handling, and a common failure scenario. Describe which pattern you'd pick for a latency-sensitive user-facing API and why.
HardTechnical
67 practiced
Design a chaos engineering experiment plan to validate failover safety for a global payments system with strict compliance needs. Specify experiment scope, failure injection types (network partition, instance termination, latency injection), safety guardrails, metrics to observe, rollback triggers, and coordination with compliance and fraud teams.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.