InterviewStack.io LogoInterviewStack.io

High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

HardTechnical
0 practiced
Design a zero-downtime failover approach for a replicated message queue system similar to Kafka. Consider partition leadership, consumer groups, in-flight messages, exactly-once semantics, and how to avoid message duplication or loss during leader reassignment and cross-region failover.
HardSystem Design
0 practiced
Design a leader-follower replication strategy across regions for a write-heavy OLTP workload to minimize RPO and RTO. Discuss synchronous, semi-synchronous, and async replication modes, witness nodes, quorum requirements, commit latency, and mechanisms to prevent split-brain during partitions.
EasyTechnical
0 practiced
Explain eventual consistency and give two concrete examples: one application feature that tolerates eventual consistency and one that requires stronger consistency guarantees. Describe practical measures to reduce user surprises under eventual consistency.
MediumTechnical
0 practiced
Implement a consistent hashing ring in Python that supports add_node(node_id), remove_node(node_id), and get_node(key). Your design should use virtual nodes to reduce imbalance and avoid remapping most keys when nodes change. Provide complexity and briefly explain your choices.
MediumTechnical
0 practiced
You maintain a microservice that is the critical write path backed by an external datastore. Explain how you would perform capacity planning and autoscaling so that in the event of a region failure the remaining regions can absorb the traffic. Show calculations for required capacity if a region fails and traffic shifts (e.g., traffic doubles in remaining regions).

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.