InterviewStack.io LogoInterviewStack.io

Monitoring and Alerting Questions

Designing monitoring, observability, and alerting for systems with real-time or near real-time requirements. Candidates should demonstrate how to select and instrument key metrics (latency end to end and per-stage, throughput, error rates, processing lag, queue lengths, resource usage), logging and distributed tracing strategies, and business and data quality metrics. Cover alerting approaches including threshold based, baseline and trend based, and anomaly detection; designing alert thresholds to balance sensitivity and false positives; severity classification and escalation policies; incident response integration and runbook design; dashboards for different audiences and real time BI considerations; SLOs and SLAs, error budgets, and cost trade offs when collecting telemetry. For streaming systems include strategies for detecting consumer lag, event loss, and late data, and approaches to enable rapid debugging and root cause analysis while avoiding alert fatigue.

MediumBehavioral
64 practiced
Tell me about a time (or describe a hypothetical scenario) where you convinced product stakeholders to accept a change to SLOs, for example relaxing freshness requirements for non-critical features. Explain the data you presented, how you quantified trade-offs (business impact vs engineering cost), and how you negotiated the implementation and monitoring plan.
MediumTechnical
57 practiced
Explain sampling strategies for traces and logs in high-traffic data platforms: head-based sampling, tail-based sampling, and probabilistic sampling. For a mission-critical pipeline, recommend what should be sampled 100% (if anything), what should be sampled selectively, and suggest initial sampling rates or policies.
MediumSystem Design
57 practiced
You're building a near-real-time BI dashboard for analysts that shows 'data freshness', 'last loaded partition timestamp', and 'rows ingested per minute' with sub-minute latency. Describe an architecture to support this dashboard: metric collection, aggregation/rollup frequency, a low-latency store for metrics, caching strategy, and techniques to avoid overloading the pipeline when many analysts query the dashboard simultaneously.
EasyTechnical
91 practiced
Design three dashboards for the same pipeline: (1) an executive-level KPI dashboard, (2) an SRE/data-engineer operations dashboard, and (3) a data-quality dashboard for analysts. For each dashboard list six widgets/metrics, the refresh cadence, alerting links to runbooks, and a rationale for why each metric belongs on that dashboard.
HardTechnical
70 practiced
A downstream consumer observes duplicates in a fact table. Producer telemetry shows no errors. Provide a detailed root-cause analysis plan: what metadata, metrics and logs you would collect to determine if duplicates were introduced upstream, in transport, or at the sink; write detection queries that can prove the source of duplicates; and propose corrective strategies (idempotent writes, dedup windows, upsert semantics, backfill) including trade-offs.

Unlock Full Question Bank

Get access to hundreds of Monitoring and Alerting interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.