InterviewStack.io LogoInterviewStack.io

Incident Classification and Severity Questions

Focuses on structured approaches to classifying incidents and assigning severity levels to drive appropriate response, escalation, and communication. Covers defining severity criteria based on customer impact, affected services, scope of impact, and regulatory concerns, mapping severity to response playbooks and on call rotations, establishing escalation paths and communication cadences, defining service level objectives and response time targets, coordinating cross functional responders, and creating runbooks and automated tooling to enforce the framework. Also includes governance topics such as reviewing and refining severity definitions from post incident analyses, training responders on the framework, and adjusting thresholds to reduce false positives and ensure consistent prioritization.

EasyTechnical
30 practiced
Explain what "incident severity" means in the context of Site Reliability Engineering. Compare and contrast "severity" vs "priority" with concrete examples. Define at least four severity levels (e.g., Sev-0/Sev-1/Sev-2/Sev-3), give objective criteria for each (customer impact, data loss, regulatory exposure, scope), and provide one short real-world example incident for each level to illustrate the difference.
MediumTechnical
51 practiced
You notice a recurring pattern where brief traffic spikes from noisy clients cause Sev-1 alerts. Outline a plan to reduce these false positives while ensuring real incidents are still reliably surfaced. Include immediate mitigations (cooldowns, client throttling), metric smoothing or aggregation, topology-aware alerting (per-customer vs global), and longer-term architectural changes to reduce blast radius.
HardTechnical
30 practiced
After a postmortem you discover several incidents labeled Sev-2 should have been Sev-1. Propose a set of process and tooling changes to prevent future misclassifications: mandatory severity review cadence, severity-thermostat metrics, automated guardrails that flag misclassifications, improved training, and a lightweight appeals/audit process. Explain how you'd measure improvement and assign accountability.
HardTechnical
33 practiced
During a major incident, multiple teams begin making independent mitigation changes causing duplicated and conflicting actions. How would you ensure coordinated response so a single incident commander directs actions, prevent duplicated mitigation, and guarantee that all actions are logged? Describe the process changes, tooling (chatops, locks, change-coordination), and cultural practices you would implement.
MediumSystem Design
36 practiced
You operate an alert pipeline that generates 20,000 alerts per day. Propose a scalable architecture to triage and classify alerts such that only high-severity incidents page human on-call. Include components (ingestion, deduplication, enrichment with deploy & topology metadata, rules engine, optional ML classifier, routing), data flow, and describe tradeoffs between rule-based vs ML-based classification for reliability, transparency, and operational overhead.

Unlock Full Question Bank

Get access to hundreds of Incident Classification and Severity interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.