InterviewStack.io LogoInterviewStack.io

Monitoring and Alerting Questions

Designing monitoring, observability, and alerting for systems with real-time or near real-time requirements. Candidates should demonstrate how to select and instrument key metrics (latency end to end and per-stage, throughput, error rates, processing lag, queue lengths, resource usage), logging and distributed tracing strategies, and business and data quality metrics. Cover alerting approaches including threshold based, baseline and trend based, and anomaly detection; designing alert thresholds to balance sensitivity and false positives; severity classification and escalation policies; incident response integration and runbook design; dashboards for different audiences and real time BI considerations; SLOs and SLAs, error budgets, and cost trade offs when collecting telemetry. For streaming systems include strategies for detecting consumer lag, event loss, and late data, and approaches to enable rapid debugging and root cause analysis while avoiding alert fatigue.

MediumTechnical
64 practiced
High-throughput services produce more logs and traces than can be stored economically. Describe sampling strategies (head-based, tail-based, adaptive) and pragmatic rules you would apply to preserve useful signals for BI root cause analysis while controlling cost.
EasyTechnical
49 practiced
Explain the differences between logs, metrics, and distributed traces. For a BI data pipeline provide one concrete example of a log entry you would emit, one metric you would expose (name, type, unit), and one trace/span you would create. Discuss retention and cardinality trade-offs for each signal and cost implications for a mid-sized BI org.
HardTechnical
48 practiced
A key business metric shows a gradual drift following a schema migration. Outline a root cause analysis plan that uses data lineage, checkpoints, instrumentation, and historical snapshots to identify whether the migration, transform logic, or source data changed are responsible. Provide the concrete queries and checks you'd run.
HardSystem Design
60 practiced
Propose a retention and aggregation strategy for metrics and logs for a BI org that balances queryability and cost. Include choices for hot vs warm vs cold tiers, rollups (e.g., 1s -> 1m -> 1h), downsampling, and materialization of common queries. Describe how you would implement rollup pipelines.
HardTechnical
53 practiced
Describe how you would instrument distributed tracing spans to compute per-stage and end-to-end latency for an ETL workflow where stages run across different services. Propose a span naming convention, minimum attributes, and the query/aggregation you would run to compute median and P99 per-stage latency.

Unlock Full Question Bank

Get access to hundreds of Monitoring and Alerting interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.