Fault Tolerance and Failure Scenarios Questions

Designing systems resilient to component failures: timeouts, retries with exponential backoff, circuit breakers, bulkheads. Discuss cascading failure prevention and graceful degradation. At Staff level, demonstrate thinking about multi-layer failures (service failures, database failures, network partitions) and how to detect and recover from them.

HardTechnical

66 practiced

You observe a cascading failure: a spike to the auth service slows DB queries, DB overload causes cache misses, which increases DB load further. Design a multi-layer mitigation plan combining circuit breakers, adaptive rate-limiting, backpressure, cache priming and automated runbook actions to stop the cascade within 60 seconds. Include detection logic, priority order of actions, and rollback conditions.

MediumTechnical

85 practiced

Implement a circuit-breaker half-open strategy where trial windows are scheduled with exponentially increasing sizes (try 1 request after 30s, then 2 requests after 60s, then 4 requests). Implement in your preferred language a thread-safe mechanism to count trials, reset on success, and schedule the next trial window. Explain how to avoid thundering-herd during half-open transitions.

HardSystem Design

137 practiced

Design a robust health-check and probing system for 1000 microservices that detects multi-layer failures (process crash, downstream DB error, network partition) while minimizing false positives and flapping. Requirements: low probe overhead, dynamic probe definitions, integration with alerts/auto-remediation, and probe backoff to avoid overload. Describe architecture and probe types.

HardTechnical

63 practiced

You are paged for a high-severity incident: a media streaming service shows elevated global error rates and timeouts; traces show increased latency to the CDN and origin-region network. As SRE lead, outline your incident response plan: detection, triage, immediate mitigations, stakeholder communication, role assignments, and what you'd capture for post-incident analysis focusing on multi-layer failures.

MediumTechnical

76 practiced

How would you tune circuit-breaker thresholds so they don't trip during valid traffic spikes (for example: promotional events), while still protecting the system during real upstream failures? Describe techniques like dynamic thresholds, load-aware windows, baseline adjustments and anomaly-detection integration.

Unlock Full Question Bank

Get access to hundreds of Fault Tolerance and Failure Scenarios interview questions and detailed answers.

Join thousands of developers preparing for their dream job.