InterviewStack.io LogoInterviewStack.io

Monitoring and Alerting Questions

Designing monitoring, observability, and alerting for systems with real-time or near real-time requirements. Candidates should demonstrate how to select and instrument key metrics (latency end to end and per-stage, throughput, error rates, processing lag, queue lengths, resource usage), logging and distributed tracing strategies, and business and data quality metrics. Cover alerting approaches including threshold based, baseline and trend based, and anomaly detection; designing alert thresholds to balance sensitivity and false positives; severity classification and escalation policies; incident response integration and runbook design; dashboards for different audiences and real time BI considerations; SLOs and SLAs, error budgets, and cost trade offs when collecting telemetry. For streaming systems include strategies for detecting consumer lag, event loss, and late data, and approaches to enable rapid debugging and root cause analysis while avoiding alert fatigue.

EasyTechnical
0 practiced
Describe the differences between threshold-based alerts, baseline/trend-based alerts, and anomaly-detection alerts. For a streaming pipeline that processes events per second, give one concrete example where each alert type is the best fit (including typical detection window and why).
MediumTechnical
0 practiced
Write a Prometheus alerting rule (YAML) that fires when the average Kafka consumer lag per partition for any consumer_group exceeds 300 seconds for more than 5 minutes. Assume a Prometheus metric named kafka_consumer_lag_seconds with labels {consumer_group, partition}. Include grouping, annotations with runbook_url, and a severity label.
HardSystem Design
0 practiced
Design alert suppression and deduplication rules to prevent alert storms during cascading failures, such as underprovisioned Kafka brokers that cause many consumer groups to spike in lag. The rules should show a representative set of alerts (one per-region or per-service), keep context for root-cause analysis, and avoid hiding unrelated new alerts. Describe algorithms and practical implementations in common alerting platforms (Prometheus Alertmanager, Datadog, PagerDuty).
MediumTechnical
0 practiced
Design an observability pipeline using OpenTelemetry collectors to instrument a heterogeneous stack: Kafka producers, Spark streaming jobs, and a downstream BigQuery warehouse. Describe the flow for traces, metrics, and logs from app instrumentation through collectors to long-term storage (Prometheus remote_write, OTLP exporter, cloud logging), including sampling configuration and how to support cross-telemetry correlation for root cause analysis.
MediumTechnical
0 practiced
During a planned data migration, telemetry shows elevated CPU and increased end-to-end latency. Alerts are noisy and distracting. Propose a safe plan to temporarily adjust alerting rules and notification routing during the migration so critical signals remain visible, noise is minimized, and nothing important is missed. Explain how you'd communicate and roll back these changes.

Unlock Full Question Bank

Get access to hundreds of Monitoring and Alerting interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.