InterviewStack.io LogoInterviewStack.io

Operational Excellence and Resilience Questions

Design and operationalize systems and processes that deliver efficiency, cost effectiveness, and resilient service delivery. Cover approaches to cost optimization and right sizing, automation and self healing, monitoring and observability, service level objectives and agreements, incident response and disaster recovery planning, chaos engineering and resilience testing, capacity planning, and continuous improvement practices. Candidates should explain trade offs between cost and reliability, instrumentation and alerting strategies, and how to measure and improve operational maturity.

MediumTechnical
0 practiced
Instrumentation increases observability but also adds compute and storage overhead. Explain the trade-offs between data fidelity and performance/cost. How would you choose sampling rates for traces and logs, where to apply cardinality limits, and what techniques to use to preserve critical signals while reducing cost?
MediumTechnical
0 practiced
You observe a sudden increase in 99th-percentile latency across multiple microservices shortly after a deploy. Create a prioritized 30-minute triage checklist that narrows down scope and probable causes, including specific metrics, logs, and tracing queries to execute and what rapid mitigations you might apply while investigating.
MediumTechnical
0 practiced
Write a concise runbook that an on-call engineer can execute to restore service when a read replica becomes read-only or when replication lag exceeds acceptable thresholds. Include detection thresholds, commands to inspect replication state, immediate mitigations (e.g., redirecting reads), steps to promote or reconfigure replicas, and post-recovery verification checks.
MediumSystem Design
0 practiced
Design a backup and restore strategy for a relational database-backed service with RPO = 15 minutes and RTO = 1 hour during peak load. Describe backup types (snapshots, WAL shipping), replication topology, storage choices, how to perform point-in-time recovery, how to validate backups, and the runbook for performing a restore in production.
MediumTechnical
0 practiced
You need to add self-healing capabilities to a microservice without modifying its core business logic. Provide several approaches (infrastructure-level, sidecars, proxies, orchestration) with concrete implementation details, examples, and trade-offs. Explain how you would handle stateful connections and in-flight requests during automated healing actions.

Unlock Full Question Bank

Get access to hundreds of Operational Excellence and Resilience interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.