InterviewStack.io LogoInterviewStack.io

Data Pipeline Monitoring and Observability Questions

Focuses on designing monitoring and observability specifically for data pipelines and streaming workflows. Key areas include instrumenting pipeline stages, tracking health and business level metrics such as latency throughput volume and error rates, detecting anomalies and backpressure, ensuring data quality and completeness, implementing lineage and impact analysis for upstream failures, setting service level objectives and alerts for pipeline health, and enabling rapid debugging and recovery using logs metrics traces and lineage data. Also covers tooling choices for pipeline telemetry, alert routing and escalation, and runbooks for operational playbooks.

EasyTechnical
21 practiced
Implement (describe or write) a simple instrumentation approach in Python to measure processing duration and error count for a transformation function using the prometheus_client library. Describe label choices to avoid high-cardinality explosion and how you would expose these metrics from a fleet of workers.
EasyTechnical
28 practiced
Explain what data lineage is and describe three concrete ways lineage improves observability for data pipelines, such as impact analysis, debugging, and compliance. Provide example queries or API calls you would want from a lineage service when an upstream table is corrupted.
MediumTechnical
29 practiced
Compare options for telemetry storage: Prometheus for short-term metrics, Thanos/Cortex for long-term metrics, ELK for logs, and Tempo/Jaeger for traces. For each option discuss retention, query latency, cost profile, cardinality constraints, and typical use cases within a data pipeline observability architecture.
MediumSystem Design
26 practiced
Design a monitoring and observability plan for a nightly ETL pipeline that must process 10 TB/day across 5,000 files and meet a 06:00 completion SLO. Describe the set of dashboards, essential metrics (per-file and aggregated), alert rules, runbook steps, and how you would validate the monitoring end-to-end during deployment.
MediumTechnical
21 practiced
Given these two monitoring tables, consumer_offsets(topic, partition, group_id, offset, committed_at) and topic_end_offsets(topic, partition, log_end_offset, recorded_at), write an ANSI SQL query to compute per-topic and per-group total lag and the maximum partition lag. Show sample output columns: topic, group_id, total_lag, max_partition_lag.

Unlock Full Question Bank

Get access to hundreds of Data Pipeline Monitoring and Observability interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.