InterviewStack.io LogoInterviewStack.io

Incident Response and Runbook Design Questions

Covers the design and operation of incident response programs and the creation and maintenance of actionable runbooks and playbooks for production systems. Candidates should be able to explain the incident lifecycle from detection and classification through investigation, escalation, remediation, and post incident analysis. Topics include severity definitions and assessment, escalation procedures, team roles and responsibilities, communication protocols during incidents, on call rotations, alert triage, and coordination across teams during outages. Also includes designing automated remediation steps where appropriate, integrating runbooks with monitoring and alerting systems, maintaining playbooks for common failure modes such as malware, data exfiltration, denial of service, and account compromise, and conducting blameless post incident reviews and continuous improvement. Candidates should be able to discuss metrics for measuring response effectiveness such as mean time to detect, mean time to repair, and response success rate, and describe approaches to improve those metrics over time.

MediumSystem Design
67 practiced
Design an integration between Prometheus Alertmanager, PagerDuty, and a runbook execution platform so that specific alerts can automatically present the correct runbook, allow one-click play execution, record execution logs, and prevent unsafe automated actions. Sketch the architecture, data flows, authentication/authorization checks, and failure modes you must mitigate.
HardTechnical
68 practiced
Write a runbook for the scenario: the primary node of an ACID database cluster becomes unresponsive while secondaries are lagging. The runbook must include checks to measure lag, decision criteria for failover, steps to perform a safe failover, verification steps to avoid split-brain, and rollback procedures. Explain the trade-offs between RTO and potential data loss.
HardTechnical
71 practiced
Describe how runbooks and security incident response playbooks should interact. For example, if a runbook detects malware behavior, when should it invoke the security playbook? Define clear boundaries, handoff criteria, and how to keep runbooks and security playbooks synchronized so responders know which to follow under mixed incidents.
MediumSystem Design
73 practiced
You are asked to design a production runbook template for a payment-processing service (10k RPS) that must include automated pre-checks, human escalation steps, safe rollback, and verification. Provide a detailed runbook structure with sample commands/queries (no need to write executable code) and explain the rationale for each section and any safety guards you add to prevent data loss or double-processing.
EasyTechnical
112 practiced
Design short templates for incident communication: (A) an internal engineering standup update at T+15 minutes, and (B) an external status page message to customers at T+30 minutes for a partial outage affecting 20% of users. Each template should include: what to say, what not to say, update cadence, and who approves the external message.

Unlock Full Question Bank

Get access to hundreds of Incident Response and Runbook Design interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.