Alerting Strategy and Incident Response Questions

Design alerting strategies and incident response practices that turn observability signals into actionable operations. Topics include alert design and classification, threshold versus anomaly detection, preventing alert fatigue, escalation and on call flow, runbook and playbook design, integrating alerts with incident management, post incident review and blameless postmortems, and how monitoring and observability feed incident detection and mean time to resolution improvements. Includes designing alerts for different domains and thinking through what runbooks and context to provide to responders.

HardSystem Design

0 practiced

Design automated runbook actions for common ML incidents (for example: automated confidence-threshold rollback, model-version rollback, or reprocessing a bad batch) using orchestration tools (Airflow/Kubernetes Jobs). Describe safety checks, idempotency, RBAC, approval workflow, and audit logging requirements.

MediumTechnical

0 practiced

Design a runbook for this incident: 'A nightly batch feature pipeline wrote null values for a key feature for a single customer segment, causing degraded model performance for that segment.' Include triage checks, short-term mitigation, reprocessing/backfill logic, and long-term preventative actions.

MediumTechnical

0 practiced

Write SQL (pseudo-SQL acceptable) to compute per-feature population histograms over the last 30 days and compare them to a 90-day baseline using Kullback-Leibler divergence. Table: features(model_id, feature_name, value, ts). Show handling of continuous features via binning and how to flag features with KL > threshold.

EasyTechnical

0 practiced

What structured logging and tracing information should a model-serving endpoint emit to allow a responder to diagnose prediction pipeline failures without exposing PII? Provide a list of fields and explain why each helps triage.

MediumTechnical

0 practiced

Design an experiment to evaluate whether a new anomaly-detection alert reduces Mean Time To Detect (MTTD) and Mean Time To Recover (MTTR) for model-quality incidents. Outline A/B cohorts, metrics to collect, duration, and statistical checks you would use.

Unlock Full Question Bank

Get access to hundreds of Alerting Strategy and Incident Response interview questions and detailed answers.

Join thousands of developers preparing for their dream job.