InterviewStack.io LogoInterviewStack.io

Reliability and Operational Excellence Questions

Covers design and operational practices for building and running reliable software systems and for achieving operational maturity. Topics include defining, measuring, and using Service Level Objectives, Service Level Indicators, and Service Level Agreements; establishing error budget policies and reliability governance; measuring incident impact and using error budgets to prioritize work. Also includes architectural and operational techniques such as redundancy, failover, graceful degradation, disaster recovery, capacity planning, resilience patterns, and technical debt management to improve availability at scale. Operational practices covered include observability, monitoring, alerting, runbooks, incident response and post incident analysis, release gating, and reliability driven prioritization. Proactive resilience practices such as fault injection and chaos engineering, as well as trade offs between reliability, cost, and development velocity and scaling reliability practices across teams and organizations, are included to capture both hands on and senior level discussions.

MediumTechnical
0 practiced
Write a Terraform (HCL) snippet that creates an AWS CloudWatch alarm for a custom metric 'api_error_rate' which triggers when error rate exceeds 0.5% for 5 consecutive minutes, plus an SNS topic 'slo-breach-alerts' to notify subscribers. Include the required resource blocks and mention any IAM assumptions needed for CloudWatch to publish to SNS.
MediumSystem Design
0 practiced
Design SLOs for an e-commerce checkout service that depends on payment gateway, inventory, and search systems. Explain which SLOs map to business KPIs (e.g., conversion rate), which are service-level, how to handle downstream failures, and how to surface these SLOs in dashboards for product and engineering stakeholders.
HardTechnical
0 practiced
Implement a concurrency-safe circuit breaker in Go (Golang) that supports closed, open, and half-open states. Requirements: use a sliding window to compute error rate over the last N requests or T seconds, allow configuration for error threshold and timeout, and provide thread-safe state transitions. You may provide idiomatic Go pseudocode or real code and explain concurrency primitives used.
MediumTechnical
0 practiced
Design a synthetic monitoring strategy for a global SaaS product: decide what checks to run (availability, auth, checkout flow), where to run them (global locations), frequency, script complexity, cost trade-offs, and how to use synthetic results alongside real-user metrics to detect regressions earlier.
MediumSystem Design
0 practiced
Create a runbook template for an on-call responder handling a payment gateway outage. The template must include triage checklist, health checks/commands, mitigation steps (circuit-breakers, fallback payments), rollback criteria, stakeholder communication templates, and post-incident follow-ups.

Unlock Full Question Bank

Get access to hundreds of Reliability and Operational Excellence interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.