Covers the end to end process of investigating incidents and converting findings into durable program improvements. Candidates should be able to describe how to run structured post incident reviews and root cause analyses that probe beyond the immediate failure to uncover underlying system, process, human, and governance causes. Topics include evidence collection, timeline reconstruction, causal analysis techniques, identification and prioritization of corrective actions, remediation tracking and verification, validating effectiveness of fixes, communicating lessons learned across teams, and using incident data to inform risk assessments and policy or process changes. Emphasis should be placed on practical examples of preventing recurrence, balancing near term containment with long term fixes, and building a blameless culture that supports continuous improvement.
MediumTechnical
62 practiced
How would you define and implement SLOs and alerting for ML model accuracy and latency? Provide example SLO definitions (e.g., 'weekly upstream-labeled accuracy >= 90%', '99th-percentile inference latency < 200ms'), explain measurement windows and burn-rate calculation, and describe how to translate SLO violations into actionable alerts and paging rules with escalation thresholds.
HardTechnical
67 practiced
For ML systems in healthcare, discuss the trade-offs between automatic rollback (triggered by SLO violations) and manual human-in-the-loop gating. Cover regulatory reporting obligations, patient safety risk, speed of detection, risk of flapping rollbacks, auditing requirements, and propose a hybrid policy that minimizes patient risk while enabling timely mitigations.
MediumTechnical
53 practiced
You're on-call and notice inference latency for an online recommendation model increased 3x immediately after a model update. Describe step-by-step the evidence collection and timeline reconstruction you would perform: include which Kubernetes commands you'd run (e.g., kubectl describe events, kubectl top), Prometheus queries for latency and resource metrics, how you'd fetch recent inference logs, feature-store health checks, and how you'd compare recent experiment runs to previous ones to isolate the cause (model vs infra vs data).
HardTechnical
68 practiced
A production model trained over the past 6 months included label leakage that inflated offline metrics and led to poor live performance after deployment. Draft a comprehensive remediation and communication plan covering: immediate containment steps, identifying affected products/models, re-training strategy and timelines, replay/backfill plan if feasible, notifying internal and external stakeholders (customers, legal, regulators), and verification steps to declare the issue resolved.
HardSystem Design
60 practiced
Design an enterprise incident analysis platform for ML that ingests telemetry (metrics, logs, traces), stores model artifacts and dataset snapshots, reconstructs timelines, and supports automated RCA (anomaly correlation and causal hints). Scale requirements: 10,000 models, 100M predictions/day, 2TB/day telemetry. Describe architecture components (ingest, storage, query, RCA engine, UI), data schema, retention strategy, auth/ACL model, and how to support cross-team queries while controlling cost.
Unlock Full Question Bank
Get access to hundreds of Post Incident Analysis and Improvement interview questions and detailed answers.