InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

HardTechnical
81 practiced
You are the on call cloud engineer when the primary region suffers a major outage during which a suspected security breach is detected. Walk through your incident response: initial triage, deciding whether to failover or isolate the region, coordination with SRE and security teams, preserving forensic evidence, communicating to stakeholders, and steps for a secure failover if chosen. Be specific about priorities and safety checks.
EasyTechnical
65 practiced
Explain the purpose of chaos engineering in the context of high availability and disaster recovery. Give two example experiments you would run against a production-like environment to validate DR readiness and what success criteria you would measure.
MediumTechnical
77 practiced
Given SLAs of 99.9 percent, 99.95 percent, 99.99 percent, and 99.999 percent, calculate the allowed downtime per month and per year for each SLA value, and for each SLA recommend the minimum architectural choices (for example single-region multi AZ, multi-region warm standby, active-active multi-region) you would consider to meet it. Explain your reasoning and the trade offs.
EasyTechnical
80 practiced
Describe N+1 and N+2 redundancy strategies. For a cloud deployment that runs application servers across multiple availability zones, give concrete examples of when you would choose N+1 versus N+2 and explain the operational trade offs in terms of cost, capacity, and failure tolerance.
HardTechnical
77 practiced
Design a comprehensive chaos engineering plan specifically aimed at validating disaster recovery readiness for a payment processing service. Include which experiments you would run (region failover, database promotion, network partition), how you would measure convergence and recovery times, safety controls, and how you would incorporate findings into runbook improvements.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.