Data Consistency and Recovery Questions

Covers the spectrum of data consistency models used in distributed systems and the operational practices for detecting and recovering from inconsistency. Topics include strong consistency guarantees provided by atomicity, consistency, isolation, and durability style transactions and synchronous replication, and weaker models such as eventual consistency and causal consistency along with their read guarantees like read your writes and monotonic reads. Explain the trade offs between consistency, availability, and latency and how those trade offs influence architecture decisions, user experience, and cost. Discuss replication strategies including synchronous replication, asynchronous replication, and read replicas, and how replication modes affect staleness and failure behavior. Include coordination and consensus mechanisms for achieving stronger guarantees, for example leader based replication and consensus protocols, and distributed transaction approaches such as two phase commit. Cover operational concerns: how consistency choices change testing, deployment, monitoring, and incident response. Describe detection and recovery techniques for inconsistency such as validation checks, reconciliation and anti entropy processes, tombstones and conflict resolution strategies, use of vector clocks or conflict free replicated data types to resolve concurrent updates, point in time recovery and backups, and procedures for partial repairs, rollbacks, and replays. At senior levels also address how consistency decisions shape runbooks, alerting, and post incident analysis.

MediumSystem Design

0 practiced

Design a monitoring and alerting strategy to detect data divergence between a primary database and its replicas. List key metrics, sampling strategies, thresholds, and how to prioritize alerts to avoid noise. Include how you would measure divergence for both key-value stores and complex relational data.

EasyTechnical

1 practiced

Define the read guarantees 'read-your-writes' and 'monotonic reads' in distributed storage systems. Provide a short example for each showing client actions and server responses, and explain how an SRE might instrument or enforce these guarantees at the client or middleware layer.

MediumTechnical

0 practiced

Implement a grow-only counter (G-counter) CRDT in Python. The API should support increment(node_id, amount) and merge(other_counter) and value() operations. Describe expected convergence properties and show sample usage: two nodes increment independently, then merge, and produce the correct combined count.

MediumTechnical

0 practiced

After an incident several writes were lost on one shard but present on others. Outline a practical partial repair and reconciliation playbook an SRE would follow: detection, containment, selecting source of truth, replaying or patching data, testing the repair, and communicating with stakeholders.

HardSystem Design

0 practiced

Design a rollback and replay mechanism using a write-ahead log (WAL) that allows partial repair across replicas after a corruption event. Explain how to ensure idempotency, handle gaps in the WAL, determine canonical ordering, and coordinate replay without further corrupting other replicas.

Unlock Full Question Bank

Get access to hundreds of Data Consistency and Recovery interview questions and detailed answers.

Join thousands of developers preparing for their dream job.