InterviewStack.io LogoInterviewStack.io

Monitoring Tools and Observability Questions

Covers hands on familiarity with modern monitoring and observability platforms and the practices for instrumenting and operating production systems. Candidates should be able to describe one or more tools such as Prometheus, Grafana, Datadog, CloudWatch, and explain how to write queries, design dashboards, and configure alerts. Include understanding of metrics collection, time series databases, log aggregation, distributed tracing, and common query languages used by these platforms. Also cover integrating monitoring with incident management systems such as PagerDuty and Opsgenie, defining service level indicators and objectives, setting alerting thresholds to reduce noise, and using dashboards and alerts to troubleshoot performance and availability issues.

HardSystem Design
73 practiced
Architect a multi-region observability pipeline for a global SaaS company ingesting 1B metrics/day and 10TB logs/day across 3 regions. Requirements: cross-region queryability, low-latency dashboards, disaster recovery, and cost controls. Describe ingestion layer, TSDB/log store choices, replication, query federation, and operational trade-offs.
HardTechnical
91 practiced
Write a PromQL expression (or expressions) that detects sudden cardinality growth by measuring the rate of increase of unique series for metric 'http_requests_total' grouped by label set (job, instance, endpoint, method, status) over the last 5 minutes. Explain how the expressions work and their limitations in noisy environments.
EasyTechnical
91 practiced
A client asks you to estimate monthly monitoring costs. List the quantitative inputs you need to produce a reliable estimate (e.g., number of hosts, metrics per host, custom metric cardinality, logs events/sec, average span size and spans per request, retention windows) and explain briefly how each factor drives ingestion and storage costs.
EasyTechnical
74 practiced
Explain metric cardinality, why it is harmful for Prometheus and other TSDBs, and provide three concrete strategies you would recommend to a client to control cardinality (with examples such as label whitelisting, relabeling, or bucketing). For each strategy state the trade-off.
HardSystem Design
83 practiced
Design a resilient OpenTelemetry Collector deployment for edge services with intermittent connectivity. Include local buffering strategies, disk usage limits, batching, retry and back-off policies, telemetry prioritization for limited bandwidth, and considerations for delivery semantics (at-least-once vs best-effort).

Unlock Full Question Bank

Get access to hundreds of Monitoring Tools and Observability interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.