InterviewStack.io LogoInterviewStack.io

Monitoring and Alerting Questions

Designing monitoring, observability, and alerting for systems with real-time or near real-time requirements. Candidates should demonstrate how to select and instrument key metrics (latency end to end and per-stage, throughput, error rates, processing lag, queue lengths, resource usage), logging and distributed tracing strategies, and business and data quality metrics. Cover alerting approaches including threshold based, baseline and trend based, and anomaly detection; designing alert thresholds to balance sensitivity and false positives; severity classification and escalation policies; incident response integration and runbook design; dashboards for different audiences and real time BI considerations; SLOs and SLAs, error budgets, and cost trade offs when collecting telemetry. For streaming systems include strategies for detecting consumer lag, event loss, and late data, and approaches to enable rapid debugging and root cause analysis while avoiding alert fatigue.

EasyTechnical
0 practiced
You are onboarding a new near-real-time ETL pipeline that ingests events and updates dashboards. List and justify the key monitoring metrics you would instrument initially (include end-to-end latency, per-stage latency, throughput, error rates, processing lag, queue lengths, CPU/memory usage). For each metric state the recommended aggregation frequency (e.g., 1s / 10s / 1m), short vs long retention needs, and which audience(s) will use it (executive dashboard, data-ops, on-call alert).
MediumTechnical
0 practiced
Describe a practical approach for building a trend-based anomaly detection system for data-quality metrics (e.g., daily unique users, null-rate on a key column). Include how you'd build the baseline, handle seasonality, set alert thresholds, and evaluate false positive rate.
HardTechnical
0 practiced
Design an algorithm to auto-tune alert thresholds for a metric using an exponentially weighted moving average (EWMA) or Holt-Winters seasonal model. Describe how you'd avoid alert storms due to transient noise and how you'd validate that automated threshold adjustments are safe.
MediumTechnical
0 practiced
A dashboard shows a spike in P99 latency for a core query. Walk through a concrete triage process (logs, metrics, traces, SQL) to identify whether the cause is (a) upstream data skew, (b) compute resource saturation, or (c) a query plan regression. Include queries or commands you would run and what signals you'd expect to see.
HardTechnical
0 practiced
Design a SQL + materialized-view based approach to detect regressions in a business KPI (e.g., daily active users) after a deployment. The system should minimize false positives and allow quick rollback decisions. Describe schema, queries, and how to present the signal to on-call and product teams.

Unlock Full Question Bank

Get access to hundreds of Monitoring and Alerting interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.