Incident Response and Runbook Design Questions

Covers the design and operation of incident response programs and the creation and maintenance of actionable runbooks and playbooks for production systems. Candidates should be able to explain the incident lifecycle from detection and classification through investigation, escalation, remediation, and post incident analysis. Topics include severity definitions and assessment, escalation procedures, team roles and responsibilities, communication protocols during incidents, on call rotations, alert triage, and coordination across teams during outages. Also includes designing automated remediation steps where appropriate, integrating runbooks with monitoring and alerting systems, maintaining playbooks for common failure modes such as malware, data exfiltration, denial of service, and account compromise, and conducting blameless post incident reviews and continuous improvement. Candidates should be able to discuss metrics for measuring response effectiveness such as mean time to detect, mean time to repair, and response success rate, and describe approaches to improve those metrics over time.

MediumTechnical

0 practiced

Implement a small Python function that receives an alert JSON (fields: alert_id, service, severity, labels) and returns the best matching runbook id from a list of runbook metadata entries (each has runbook_id, service_tags, keywords). Use simple scoring: +2 match for service tag, +1 for each keyword in labels; break ties by severity preference. Provide code in Python and explain its time complexity.

MediumTechnical

0 practiced

Given an incidents table with schema incidents(incident_id, service, detected_at, acknowledged_at, resolved_at), write a SQL query to compute MTTA and MTTR per service for the last 90 days. Mention assumptions you make about null resolved_at rows and daylight savings/timezones.

MediumTechnical

0 practiced

Write a runbook for suspected account compromise of a privileged service account used by multiple microservices. Steps should include immediate containment, secret rotation, session invalidation, dependency mitigation, communication, and validation tests to verify the compromise has been resolved.

MediumTechnical

0 practiced

Describe the design considerations and safety mechanisms you would add before enabling automated remediation for a remediation that restarts a backend service cluster. Cover idempotency, rate limiting, circuit breakers, human-approval gates, observability, and rollback strategies.

EasyTechnical

0 practiced

Define what a runbook is and how it differs from a playbook. Give three concrete examples of when you would use a runbook versus when you would use a playbook in an enterprise production environment.

Unlock Full Question Bank

Get access to hundreds of Incident Response and Runbook Design interview questions and detailed answers.

Join thousands of developers preparing for their dream job.