InterviewStack.io LogoInterviewStack.io

Disaster Recovery and Business Continuity Questions

Designing and maintaining plans, architectures, and processes to ensure service continuity and recoverability after major incidents or disasters. Topics include defining Recovery Time Objective and Recovery Point Objective, conducting business impact analysis and tiering services by criticality, dependency mapping and recovery ordering, selecting replication and backup strategies including synchronous and asynchronous replication, active active and active passive topologies, snapshots and transaction log based point in time recovery, and planning cold, warm, and hot recovery sites. Also covers failover and failback procedures, orchestration and automation of recovery workflows, runbook creation and stakeholder roles and communications, regular disaster recovery testing and exercises including tabletop, simulated failover, full recovery drills and chaos engineering, metrics tracking such as mean time to recovery and actual Recovery Time Objective achieved, off site and geographic redundancy considerations, cloud versus on premise trade offs, regulatory and data residency requirements, and postexercise reviews to close recovery gaps.

EasySystem Design
30 practiced
Describe a DR runbook template for a single mission-critical service. Provide key fields that should be present such as triggers, preconditions, exact recovery steps (including IaC commands if applicable), verification checks, rollback criteria, stakeholders to notify, and estimated timelines for each major step.
EasyTechnical
26 practiced
Describe the key metrics and indicators you would use to evaluate DR readiness and performance, including Mean Time To Recovery (MTTR), RTO achievement percentage, replication lag, data-loss measured in bytes/time, test pass rates, and number of critical gaps open. Explain how you would instrument these metrics and present them to different stakeholders.
HardSystem Design
27 practiced
Design an architecture to achieve RPO = 0 (zero data loss) for a small set of critical datasets across two metro-separated data centers. Consider synchronous replication technologies, commit acknowledgement semantics, the performance impact on application latency, failure modes including network partitions, and fallback strategies when synchronous replication is temporarily unavailable.
HardTechnical
29 practiced
For a regulated financial application, design mechanisms to provide auditable proof-of-recovery after a DR event. Describe who approves recovery, what signed and time-stamped evidence should be captured (checksums, snapshots, logs), chain-of-custody practices for backup media or snapshots, and automated attestations you would produce for external auditors and regulators.
HardSystem Design
21 practiced
Architect an automated failover system for primary and secondary databases that prevents split-brain, handles network partitions, and supports safe, auditable failback. Include fencing mechanisms, quorum policies, external arbitration, and reconciliation processes to handle divergent writes if they occur.

Unlock Full Question Bank

Get access to hundreds of Disaster Recovery and Business Continuity interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.