InterviewStack.io LogoInterviewStack.io

Reliability High Availability and Tradeoffs Questions

Design patterns and decision making for ensuring availability correctness and graceful behavior under failure while balancing technical trade offs. Topics include redundancy and failover strategies active passive and active active deployments; fault isolation using bulkheads and circuit breaker patterns; graceful degradation and feature gating strategies; defining and mapping service level objectives and service level agreements to recovery point and recovery time objectives; multi region and multi availability zone deployment considerations; testing for reliability including chaos engineering and fault injection; and reasoning about consistency versus availability trade offs and the operational cost of stronger guarantees. Candidates should be able to choose reliability patterns to meet business objectives and to explain their implications for cost performance and maintainability.

EasyTechnical
20 practiced
Consider this simplified architecture: Ingest API -> Load Balancer -> 2 replicas of ingest-service -> Kafka cluster (single Zookeeper and single broker) -> Consumer group -> Data warehouse. Identify single points of failure, explain the impact of each, and propose practical fixes (no code required) to remove or mitigate those SPOFs while weighing operational cost.
MediumSystem Design
36 practiced
Outline a fault-injection test plan for nightly ETL jobs that write to a data lake: include how you'd simulate corrupted messages, partial writes during uploads, upstream API latency spikes, and storage throttling. Describe automation strategies, expected outcomes, and the rollback criteria you'd require to stop a test early.
MediumSystem Design
21 practiced
You're launching an A/B experiment that introduces a computationally expensive enrichment stage in the pipeline. Design a feature-gating mechanism to roll it out safely: describe how to route test vs control events, measure resource and latency impact, keep availability under load, and implement automated rollback if SLOs degrade.
MediumSystem Design
26 practiced
Design a chaos engineering experiment plan for a streaming pipeline: list specific failures to inject (broker node crash, network partition between producers and brokers, increased processing latency), the target metrics to monitor, how to limit blast radius, and validation criteria to declare the experiment successful.
MediumTechnical
20 practiced
Compare synchronous and asynchronous replication for a service that stores critical metadata for pipelines. Discuss implications for availability, write latency, data loss risk, and operational cost. Give examples when higher write latency is acceptable and when asynchronous replication would be preferable.

Unlock Full Question Bank

Get access to hundreds of Reliability High Availability and Tradeoffs interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.