Incident Classification and Severity Questions

Focuses on structured approaches to classifying incidents and assigning severity levels to drive appropriate response, escalation, and communication. Covers defining severity criteria based on customer impact, affected services, scope of impact, and regulatory concerns, mapping severity to response playbooks and on call rotations, establishing escalation paths and communication cadences, defining service level objectives and response time targets, coordinating cross functional responders, and creating runbooks and automated tooling to enforce the framework. Also includes governance topics such as reviewing and refining severity definitions from post incident analyses, training responders on the framework, and adjusting thresholds to reduce false positives and ensure consistent prioritization.

EasyTechnical

30 practiced

Explain what "incident severity" means in the context of Site Reliability Engineering. Compare and contrast "severity" vs "priority" with concrete examples. Define at least four severity levels (e.g., Sev-0/Sev-1/Sev-2/Sev-3), give objective criteria for each (customer impact, data loss, regulatory exposure, scope), and provide one short real-world example incident for each level to illustrate the difference.

MediumTechnical

51 practiced

You notice a recurring pattern where brief traffic spikes from noisy clients cause Sev-1 alerts. Outline a plan to reduce these false positives while ensuring real incidents are still reliably surfaced. Include immediate mitigations (cooldowns, client throttling), metric smoothing or aggregation, topology-aware alerting (per-customer vs global), and longer-term architectural changes to reduce blast radius.

HardTechnical

30 practiced

After a postmortem you discover several incidents labeled Sev-2 should have been Sev-1. Propose a set of process and tooling changes to prevent future misclassifications: mandatory severity review cadence, severity-thermostat metrics, automated guardrails that flag misclassifications, improved training, and a lightweight appeals/audit process. Explain how you'd measure improvement and assign accountability.

HardTechnical

33 practiced

During a major incident, multiple teams begin making independent mitigation changes causing duplicated and conflicting actions. How would you ensure coordinated response so a single incident commander directs actions, prevent duplicated mitigation, and guarantee that all actions are logged? Describe the process changes, tooling (chatops, locks, change-coordination), and cultural practices you would implement.

MediumSystem Design

36 practiced

You operate an alert pipeline that generates 20,000 alerts per day. Propose a scalable architecture to triage and classify alerts such that only high-severity incidents page human on-call. Include components (ingestion, deduplication, enrichment with deploy & topology metadata, rules engine, optional ML classifier, routing), data flow, and describe tradeoffs between rule-based vs ML-based classification for reliability, transparency, and operational overhead.

Unlock Full Question Bank

Get access to hundreds of Incident Classification and Severity interview questions and detailed answers.

Join thousands of developers preparing for their dream job.