InterviewStack.io LogoInterviewStack.io

Fault Tolerance and Failure Scenarios Questions

Designing systems resilient to component failures: timeouts, retries with exponential backoff, circuit breakers, bulkheads. Discuss cascading failure prevention and graceful degradation. At Staff level, demonstrate thinking about multi-layer failures (service failures, database failures, network partitions) and how to detect and recover from them.

HardTechnical
66 practiced
You observe a cascading failure: a spike to the auth service slows DB queries, DB overload causes cache misses, which increases DB load further. Design a multi-layer mitigation plan combining circuit breakers, adaptive rate-limiting, backpressure, cache priming and automated runbook actions to stop the cascade within 60 seconds. Include detection logic, priority order of actions, and rollback conditions.
MediumTechnical
85 practiced
Implement a circuit-breaker half-open strategy where trial windows are scheduled with exponentially increasing sizes (try 1 request after 30s, then 2 requests after 60s, then 4 requests). Implement in your preferred language a thread-safe mechanism to count trials, reset on success, and schedule the next trial window. Explain how to avoid thundering-herd during half-open transitions.
HardSystem Design
137 practiced
Design a robust health-check and probing system for 1000 microservices that detects multi-layer failures (process crash, downstream DB error, network partition) while minimizing false positives and flapping. Requirements: low probe overhead, dynamic probe definitions, integration with alerts/auto-remediation, and probe backoff to avoid overload. Describe architecture and probe types.
HardTechnical
63 practiced
You are paged for a high-severity incident: a media streaming service shows elevated global error rates and timeouts; traces show increased latency to the CDN and origin-region network. As SRE lead, outline your incident response plan: detection, triage, immediate mitigations, stakeholder communication, role assignments, and what you'd capture for post-incident analysis focusing on multi-layer failures.
MediumTechnical
76 practiced
How would you tune circuit-breaker thresholds so they don't trip during valid traffic spikes (for example: promotional events), while still protecting the system during real upstream failures? Describe techniques like dynamic thresholds, load-aware windows, baseline adjustments and anomaly-detection integration.

Unlock Full Question Bank

Get access to hundreds of Fault Tolerance and Failure Scenarios interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.