Network Monitoring and Observability Questions

Covers strategies and tooling for observing network health and performance. Topics include active health checks versus passive telemetry, what to measure at interface and flow level, flow based telemetry such as NetFlow and sFlow and export formats such as Internet Protocol Flow Information Export, Simple Network Management Protocol based metrics, metrics hierarchy and granularity, retention and aggregation considerations, alerting strategy to manage signal to noise and avoid alert fatigue, dashboards and status pages, runbook and incident playbooks, topology and capacity planning, and common observability platforms and integrations such as Prometheus the Elastic stack and Splunk or cloud native alternatives. Interviews evaluate ability to design what to monitor how to collect it and how to turn telemetry into reliable operational signals.

MediumSystem Design

0 practiced

Describe how you would integrate network telemetry alerts with PagerDuty (or similar) and automated runbooks. Include alert enrichment (context), deduplication, escalation policies, safe automated remediation examples (e.g., BGP session restart), runbook invocation, and mechanisms to avoid remediation loops.

MediumTechnical

0 practiced

Describe how to detect asymmetric routing using passive flow records, traceroutes, packet-capture evidence, and ARP/NDP tables. Explain what telemetry you'd collect to confirm asymmetry, how you'd alert on it, and how it can impact troubleshooting and SLO calculations.

EasyTechnical

0 practiced

Explain the differences between active health checks and passive telemetry for network monitoring. Provide concrete examples (e.g., HTTP/TCP synthetic probes vs NetFlow/SNMP collectors), and discuss trade-offs in latency to detection, measurement overhead, coverage, and the kinds of failures each method is best at revealing.

MediumSystem Design

0 practiced

Design a Grafana dashboard for a network operations center (NOC) that provides first-response situational awareness. List panels, key KPIs (packet-loss, latency P50/P95/P99, top talkers, interface errors), drilldowns, alert annotations, topology overlays, and how to display incident status and runbook links for 24/7 operations.

HardTechnical

0 practiced

Create a detailed alerting policy for a critical network fabric SLO: 99.99% packet delivery within 100ms. Define detection windows and aggregation rules, alert severity levels tied to error budget burn, notification and escalation flows, automated mitigation steps, and how SREs should respond at each severity.

Unlock Full Question Bank

Get access to hundreds of Network Monitoring and Observability interview questions and detailed answers.

Join thousands of developers preparing for their dream job.