Monitoring and Alerting Questions

Designing monitoring, observability, and alerting for systems with real-time or near real-time requirements. Candidates should demonstrate how to select and instrument key metrics (latency end to end and per-stage, throughput, error rates, processing lag, queue lengths, resource usage), logging and distributed tracing strategies, and business and data quality metrics. Cover alerting approaches including threshold based, baseline and trend based, and anomaly detection; designing alert thresholds to balance sensitivity and false positives; severity classification and escalation policies; incident response integration and runbook design; dashboards for different audiences and real time BI considerations; SLOs and SLAs, error budgets, and cost trade offs when collecting telemetry. For streaming systems include strategies for detecting consumer lag, event loss, and late data, and approaches to enable rapid debugging and root cause analysis while avoiding alert fatigue.

HardSystem Design

0 practiced

Design a monitoring and alerting system for an architecture processing 1M events/sec across multi-region Kafka clusters, Kubernetes consumers, and a BigQuery warehouse. Requirements: detect consumer lag, cross-region replication lag, event loss, schema drift, and support per-tenant SLOs. Describe architecture, metric/log/trace flows, aggregation/retention, alerting severity tiers, and automated remediation options to meet an objective Mean Time To Detect (MTTD) and Mean Time To Repair (MTTR).

EasyTechnical

0 practiced

Define an SLO and an SLA for 'feature table freshness' used by online models. Give a concrete SLO example (e.g., 95% of partitions < 5 minutes lag over a 30-day window), explain how the SLA relates to the SLO, and describe how the team should measure and report an error budget for this SLO.

MediumTechnical

0 practiced

Write Python pseudocode or Spark Structured Streaming logic that computes consumer lag per partition by querying the latest broker offsets and consumer committed offsets, then calculates p50 and p95 lag across partitions and emits these percentiles as metrics to a metrics sink. Assume you can call an admin API to get offsets; focus on efficient offset polling and avoiding hot loops.

HardSystem Design

0 practiced

Design an automated framework to validate monitoring rules and runbooks before they are promoted to production. Requirements: simulate incidents (lag spikes, partial data loss, schema changes), verify that expected alerts fire and route correctly, test that runbook steps (e.g., restart job, trigger backfill) execute successfully or provide correct guidance, and integrate with CI. Describe the simulator, metric injection approach, test harness, and failure criteria.

EasyTechnical

0 practiced

Describe how you would instrument a Spark Structured Streaming job to capture key observability signals: per-stage latency, input and output counts, watermark delays, processing error counts, and resource usage (CPU/memory). Include where to place instrumentation (source, after transformations, sinks), which metric types to use (counter, gauge, histogram), and how to export those metrics to a system like Prometheus or a cloud metrics backend.

Unlock Full Question Bank

Get access to hundreds of Monitoring and Alerting interview questions and detailed answers.

Join thousands of developers preparing for their dream job.