InterviewStack.io LogoInterviewStack.io

Data Consistency and Recovery Questions

Covers the spectrum of data consistency models used in distributed systems and the operational practices for detecting and recovering from inconsistency. Topics include strong consistency guarantees provided by atomicity, consistency, isolation, and durability style transactions and synchronous replication, and weaker models such as eventual consistency and causal consistency along with their read guarantees like read your writes and monotonic reads. Explain the trade offs between consistency, availability, and latency and how those trade offs influence architecture decisions, user experience, and cost. Discuss replication strategies including synchronous replication, asynchronous replication, and read replicas, and how replication modes affect staleness and failure behavior. Include coordination and consensus mechanisms for achieving stronger guarantees, for example leader based replication and consensus protocols, and distributed transaction approaches such as two phase commit. Cover operational concerns: how consistency choices change testing, deployment, monitoring, and incident response. Describe detection and recovery techniques for inconsistency such as validation checks, reconciliation and anti entropy processes, tombstones and conflict resolution strategies, use of vector clocks or conflict free replicated data types to resolve concurrent updates, point in time recovery and backups, and procedures for partial repairs, rollbacks, and replays. At senior levels also address how consistency decisions shape runbooks, alerting, and post incident analysis.

HardTechnical
82 practiced
Explain the CAP theorem and its practical implications for SRE decisions. Give three concrete architecture choices where you would sacrifice consistency for availability or vice versa, and quantify the latency and cost impacts of each choice.
HardSystem Design
76 practiced
Design alerting thresholds and anomaly detection for data divergence signals to minimize false positives. Describe statistical baselining, adaptive thresholds, scoring rules for multi-metric signals (e.g., checksum mismatch rate, replication lag, stale-read rate), and escalation policies for SRE teams.
MediumSystem Design
94 practiced
Design a replication strategy for a write-heavy service serving 500k writes/second in a single region with strong consistency required for most operations. Explain whether you would use synchronous replication, leader sharding, partitioning, or a leaderless design; describe trade-offs for latency, availability, hardware costs, and operational complexity.
HardTechnical
91 practiced
Case study: a global payments platform requires strict correctness for money transfers and legal audit trails. Propose an architecture to guarantee consistency (no double-spend, durable audit logs) across regions, describe how you would implement distributed transactions and failure handling, and list operational tests you would run before production rollout.
EasyTechnical
73 practiced
You operate a service with a primary DB and read replicas. One replica reports replication-lag histogram: 90th percentile lag = 2s, 99th percentile = 12s, max = 90s. What conclusions can you draw about staleness risk for user reads, and what immediate SRE actions or alerts would you configure to prevent user-visible anomalies?

Unlock Full Question Bank

Get access to hundreds of Data Consistency and Recovery interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.