InterviewStack.io LogoInterviewStack.io

Data Consistency and Recovery Questions

Covers the spectrum of data consistency models used in distributed systems and the operational practices for detecting and recovering from inconsistency. Topics include strong consistency guarantees provided by atomicity, consistency, isolation, and durability style transactions and synchronous replication, and weaker models such as eventual consistency and causal consistency along with their read guarantees like read your writes and monotonic reads. Explain the trade offs between consistency, availability, and latency and how those trade offs influence architecture decisions, user experience, and cost. Discuss replication strategies including synchronous replication, asynchronous replication, and read replicas, and how replication modes affect staleness and failure behavior. Include coordination and consensus mechanisms for achieving stronger guarantees, for example leader based replication and consensus protocols, and distributed transaction approaches such as two phase commit. Cover operational concerns: how consistency choices change testing, deployment, monitoring, and incident response. Describe detection and recovery techniques for inconsistency such as validation checks, reconciliation and anti entropy processes, tombstones and conflict resolution strategies, use of vector clocks or conflict free replicated data types to resolve concurrent updates, point in time recovery and backups, and procedures for partial repairs, rollbacks, and replays. At senior levels also address how consistency decisions shape runbooks, alerting, and post incident analysis.

HardTechnical
73 practiced
Compare Paxos, Raft, and PBFT (practical Byzantine fault tolerance) from an SRE perspective: explain leader election, log replication, recovery after minority failures, performance under contention, and operational complexity including upgrades and troubleshooting.
HardTechnical
86 practiced
Simulate a two-phase commit system in Python: implement a coordinator and participant interfaces that demonstrate prepare, vote, commit, and abort flows. Include failure injection (participant crashes, coordinator crash) in your simulation and show how your system recovers or blocks. Explain limitations of your simulation relative to production systems.
MediumTechnical
97 practiced
Design a read routing policy for clients that balances low latency and data freshness. The system has a primary and multiple read replicas with variable lag. Discuss sticky sessions, quorum reads, read timestamps, or staleness-bound reads, and include how an SRE would implement and measure the policy.
HardSystem Design
69 practiced
Architect a globally distributed database that must support low-latency reads worldwide and allow writes from multiple regions. Discuss consistency model choices, whether to adopt primary-region writes, conflict-resolution strategies, metadata choices (vector clocks, timestamps), and how to ensure recoverability and acceptable availability across regional failures.
HardTechnical
81 practiced
You wake up to alerts that a network partition caused a majority of replicas to diverge for multiple minutes and now divergence is large. As lead SRE, draft an operational runbook to contain the incident: immediate actions, how to pick a canonical data set, reconciliation approach (manual vs automated), communication steps, and how to update SLOs and postmortem ownership.

Unlock Full Question Bank

Get access to hundreds of Data Consistency and Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.