InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

MediumTechnical
77 practiced
Design a monitoring and observability strategy to detect availability degradation early for a distributed service. Cover SLOs and SLIs, latency percentiles, synthetic checks, distributed tracing, log aggregation for incidents, alerting thresholds and routing, and an escalation policy to minimize mean-time-to-detection and mean-time-to-repair.
HardSystem Design
70 practiced
Architect a global active-active system for a financial trading platform that must meet 99.999% availability, RPO=0s, and RTO < 5s across three regions. Discuss consensus and ordering guarantees, multi-region replication strategy (sync vs async), leader-election/quorum placement, cross-region latency trade-offs, and operational controls (monitoring, runbooks, drills) to prove the design.
MediumSystem Design
111 practiced
You're designing the checkout service for an e-commerce platform that must achieve RPO ≤ 1s and RTO ≤ 2 minutes and expects peak load of 10,000 requests per second globally. Propose a high-level multi-region architecture: include write strategies, data replication approach, traffic routing and DNS considerations, failover orchestration approach, and consistency trade-offs you are willing to make and why.
MediumTechnical
84 practiced
Compare disaster-recovery strategies: cold-standby, pilot-light, warm-standby, and active-active. For each approach describe typical RTO/RPO ranges, cost implications, operational complexity, and when you would recommend it to a medium-sized SaaS customer with limited budget.
HardTechnical
123 practiced
Design a chaos engineering program focused on validating disaster recovery capabilities. Describe an experiment catalog (network partitions, AZ and region failovers, disk failures), blast-radius controls, automated rollback/safety gates, metrics and success criteria to observe, and how to integrate experiments into CI/CD and release calendars without endangering SLAs.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.