InterviewStack.io LogoInterviewStack.io

Operational Excellence and Resilience Questions

Design and operationalize systems and processes that deliver efficiency, cost effectiveness, and resilient service delivery. Cover approaches to cost optimization and right sizing, automation and self healing, monitoring and observability, service level objectives and agreements, incident response and disaster recovery planning, chaos engineering and resilience testing, capacity planning, and continuous improvement practices. Candidates should explain trade offs between cost and reliability, instrumentation and alerting strategies, and how to measure and improve operational maturity.

EasyTechnical
34 practiced
Explain RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Provide two practical scenarios where you'd accept a longer RPO to reduce cost and two scenarios where you would insist on a shorter RTO/RPO despite higher cost. For each scenario briefly justify the decision and impact on architecture.
HardSystem Design
31 practiced
Design a GitOps-based runbook automation system that manages blue/green and canary deployments and triggers automated rollback when SLOs degrade. Include repository layout, CI/CD integration, canary analysis strategy, promotion criteria, how a runbook is invoked automatically or manually during an incident, and how audit logs are stored.
EasyTechnical
40 practiced
Explain the practical difference between monitoring and observability. Provide a realistic example where monitoring-generated alerts are insufficient to diagnose a cascading failure and how observability (traces + high-cardinality logs) helps find the root cause.
HardTechnical
57 practiced
Propose a disaster recovery plan for migrating critical on-prem systems to the cloud over 12 months. Cover phases (assessment, pilot, staged migration), cutover strategies (parallel run, lift-and-shift, replatform), data replication during migration, rollback options if cutover fails, runbooks for each stage, and business continuity measures to maintain operations during migration weekends.
EasyTechnical
40 practiced
Describe horizontal vs vertical autoscaling and scheduled scaling. For a stateful containerized service (session affinity + local cache), which autoscaling strategy would you prefer and why? Include operational considerations like scaling cooldowns, state migration, and capacity planning.

Unlock Full Question Bank

Get access to hundreds of Operational Excellence and Resilience interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.