High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

HardTechnical

0 practiced

Define 'convergence time' during failover. Propose concrete metrics and instrumentation to measure detection time, decision-making time, and traffic-propagation time across the whole system. Suggest architectural and operational techniques to reduce convergence time (for example, health-check tuning, pre-warming, Anycast, or DNS TTL strategies) and explain the trade-offs of each.

EasyTechnical

0 practiced

Compare active-active and active-passive deployment patterns for services. For each pattern describe: a) typical failover behavior and detection needs, b) realistic RTO/RPO ranges you can achieve, and c) when you would prefer one over the other for a global SaaS product with geographically dispersed users.

MediumSystem Design

0 practiced

Design a multi-region architecture for a read-heavy global service. Compare the trade-offs between using read-replicas (single-master with replicas) and multi-master replication in terms of read latency, consistency, failover behavior, and DR complexity. State which approach you would choose for low-latency global reads and why.

HardTechnical

0 practiced

Design a highly-available and disaster-recoverable identity and key-management architecture for a SaaS product. Address secure replication of key material or wrapped keys across regions, key-rotation policies, availability during region failovers, secure emergency key-recovery workflows, and least-privilege access for recovery operations.

HardSystem Design

0 practiced

Design an orchestration component that automates failover and traffic reroute across regions: it must coordinate load balancers, DNS changes, database promotion, certificate/key distribution, and execute rollbacks on failure. Explain the orchestration state model, how you make steps idempotent, safety checks you implement, and how you avoid cascading failures.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Join thousands of developers preparing for their dream job.