InterviewStack.io LogoInterviewStack.io

Incident Management and Response Questions

Covers operational handling of production outages and service incidents across the full lifecycle from preparation through detection, triage, containment, mitigation, recovery, and post incident review. Interviewers assess monitoring and observability signals, alerting thresholds and on call rotation, severity classification and escalation paths, incident command and coordination, runbooks and playbooks, immediate containment and mitigation techniques to minimize customer impact, restoration and recovery procedures, and evidence capture when relevant. Candidates should be able to describe root cause analysis practices, blameless post incident reviews, tracking remediation and follow up actions, driving cross functional ownership of fixes, and how incident learnings feed into long term reliability improvements and tooling or automation. Senior level expectations include organizing incident response teams for production reliability, defining severity levels and escalation policies, balancing rapid decisions with risk management, and continuously improving processes, runbooks, and instrumentation.

EasyTechnical
0 practiced
Describe the essential components of a runbook for a failing scheduled ETL workflow (Airflow DAG). Include preconditions, step-by-step diagnostic commands/queries, safe mitigation actions, escalation contacts, and recovery verification steps. Provide at least five concrete actions or commands you would include and explain why.
MediumTechnical
0 practiced
Implement a Python function parse_airflow_failures(log_path: str) -> list that scans a large Airflow log file (hundreds of MB) and returns JSON summaries for the most recent 5 task failures: {timestamp, dag_id, task_id, attempt, stack_trace_snippet}. Assume failure blocks are prefixed with 'TASK_FAILED' and timestamps are ISO. Optimize for low memory usage.
MediumSystem Design
0 practiced
Design an incident management dashboard for data engineering that displays active incidents, severity, impacted pipelines/datasets, SLO burn rates, recent alerts, on-call assignments, and quick links to runbooks. Describe the data sources, data model for incidents and SLOs, key components, and scaling considerations for 1,000 engineers and 10,000 pipelines.
HardTechnical
0 practiced
During a live outage with suspected data loss, discuss the trade-offs between immediate mitigation actions (e.g., stop pipeline, rollback) versus careful investigation (which may prolong customer impact). Present a decision framework that weighs customer impact, irreversibility of action, legal/compliance constraints, and confidence in rollback correctness.
MediumTechnical
0 practiced
Implement a Python generator dedupe_alerts(stream) that accepts a stream of alert events (each with timestamp, metric, tags, value) and yields only unique alerts within a 5-minute sliding window based on a fingerprint of metric+sorted tags. Ensure memory remains bounded and explain how you evict old entries.

Unlock Full Question Bank

Get access to hundreds of Incident Management and Response interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.