Production Incident Response and Diagnostics Questions
Covers structured practices, techniques, tooling, and decision making for detecting, triaging, mitigating, and learning from failures in live systems. Core skills include rapid incident triage, establishing normal baselines, gathering telemetry from logs, metrics, traces, and profilers, forming and testing hypotheses, reproducing or simulating failures, isolating root causes, and validating fixes. Candidates should know how to choose appropriate mitigations such as rolling back, applying patches, throttling traffic, or scaling resources and when to pursue each option. The topic also includes coordination and communication during incidents, including incident command, stakeholder updates, escalation, handoffs, and blameless postmortems. Emphasis is also placed on building institutional knowledge through runbooks, automated diagnostics, improved monitoring and alerting, capacity planning, and systemic fixes to prevent recurrence. Familiarity with common infrastructure failure modes and complex multi system interactions is expected, for example cascading failures, resource exhaustion, networking and deployment issues, and configuration drift. Tooling and methods include log analysis, distributed tracing, profiling and debugging tools, cross system correlation, and practices to reduce mean time to detection and mean time to resolution.
MediumTechnical
0 practiced
Design a streaming/approximate quantile solution (e.g., t-digest or GK algorithm) to compute p50/p95/p99 latencies over a high-volume log stream where storing all raw latencies is infeasible. Describe the algorithm choice, memory bounds, error guarantees, how to merge sketches from multiple collectors, and how to expose alerts when tail latency exceeds thresholds.
HardTechnical
0 practiced
You have limited SRE resources and three concurrent incidents: (1) a P2 model-quality degradation affecting a small set of VIP customers, (2) a P1 outage affecting internal dashboards used by ops, and (3) a P3 backlog in training jobs delaying experiments. Propose a prioritization matrix with quantified criteria (e.g., revenue impact, number of customers affected, regulatory risk, risk of cascading failures), rank these incidents, and justify your prioritization and resource allocation.
HardTechnical
0 practiced
How would you instrument distributed deep learning training jobs to collect telemetry useful for diagnosing slowdowns or hardware failures? Describe which metrics to collect (GPU utilization, memory histograms, kernel stalls, NVLink bandwidth, IO throughput), ideal sampling rates, how to aggregate and centralize telemetry, and how to set alerts that meaningfully indicate hardware issues vs model code inefficiencies.
HardTechnical
0 practiced
In a complex microservice mesh for AI pipelines, many services show correlated anomalies during incidents. Discuss methods to perform root-cause analysis that can attribute causation rather than mere correlation: include trace critical-path analysis, dependency graphs with weighted edges, probabilistic causal models, and heuristics to prune the search space. Provide practical steps you would take under time pressure.
HardTechnical
0 practiced
A model's prediction quality slowly degrades during peak hours. Investigation suggests the system's outputs are fed back into upstream features (a feedback loop): recommendations change user behavior, which changes inputs, which changes predictions. Explain how to detect such feedback loops (instrumentation and experiments), measure causal impact, and design mitigation strategies such as delayed feedback windows, randomized holdouts, or ensemble guards.
Unlock Full Question Bank
Get access to hundreds of Production Incident Response and Diagnostics interview questions and detailed answers.