InterviewStack.io LogoInterviewStack.io

Monitoring and Alerting Questions

Designing monitoring, observability, and alerting for systems with real-time or near real-time requirements. Candidates should demonstrate how to select and instrument key metrics (latency end to end and per-stage, throughput, error rates, processing lag, queue lengths, resource usage), logging and distributed tracing strategies, and business and data quality metrics. Cover alerting approaches including threshold based, baseline and trend based, and anomaly detection; designing alert thresholds to balance sensitivity and false positives; severity classification and escalation policies; incident response integration and runbook design; dashboards for different audiences and real time BI considerations; SLOs and SLAs, error budgets, and cost trade offs when collecting telemetry. For streaming systems include strategies for detecting consumer lag, event loss, and late data, and approaches to enable rapid debugging and root cause analysis while avoiding alert fatigue.

HardTechnical
54 practiced
Propose an end-to-end plan to build an ML-based anomaly detection system for monitoring business metrics (e.g., revenue, purchase-rate). Cover label creation, feature engineering, model selection, evaluation metrics, productionization, and strategies to detect and mitigate model drift.
MediumTechnical
49 practiced
For BI telemetry pipelines, propose a tagging and naming convention for metrics, logs, and traces that supports multi-team usage and efficient querying. Include examples and rules to limit cardinality and to make ownership discoverable.
HardTechnical
58 practiced
Design runbooks that support automated playbooks for common BI incidents (e.g., ingestion stopped, pipeline lag, metric regressions). What structure, checks, and automation hooks would you include so runbooks can be executed both manually and by automation tooling?
EasyTechnical
91 practiced
Design two dashboard layouts (describe textually) for the same KPI: one intended for executives and one for data-ops engineers. For each layout include the key widgets, time ranges, refresh intervals, drill-down affordances, and how alerts should be surfaced or linked to runbooks.
HardSystem Design
60 practiced
Propose a retention and aggregation strategy for metrics and logs for a BI org that balances queryability and cost. Include choices for hot vs warm vs cold tiers, rollups (e.g., 1s -> 1m -> 1h), downsampling, and materialization of common queries. Describe how you would implement rollup pipelines.

Unlock Full Question Bank

Get access to hundreds of Monitoring and Alerting interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.