InterviewStack.io LogoInterviewStack.io

Data Pipeline Monitoring and Observability Questions

Focuses on designing monitoring and observability specifically for data pipelines and streaming workflows. Key areas include instrumenting pipeline stages, tracking health and business level metrics such as latency throughput volume and error rates, detecting anomalies and backpressure, ensuring data quality and completeness, implementing lineage and impact analysis for upstream failures, setting service level objectives and alerts for pipeline health, and enabling rapid debugging and recovery using logs metrics traces and lineage data. Also covers tooling choices for pipeline telemetry, alert routing and escalation, and runbooks for operational playbooks.

EasyTechnical
26 practiced
Write a SQL query (ANSI SQL) that calculates daily completeness per data_source and partition_date for a table of ingested events. The table schema:
events(event_id STRING, data_source STRING, occurred_at TIMESTAMP, partition_date DATE, payload JSON)
Completeness is defined as: number_of_events_ingested / expected_events_for_data_source_on_that_date. Assume there's a table expected_counts(data_source STRING, partition_date DATE, expected_count BIGINT).Return rows where completeness < 0.95.
EasyTechnical
26 practiced
Explain the difference between logs, metrics, and traces in the context of data pipeline observability. Give one concrete example of when you would rely primarily on each one to investigate a late-arriving record scenario in a multi-stage streaming job.
MediumTechnical
29 practiced
You observe that an alert on high error rate is firing frequently but most pages are resolved by restarting a downstream worker. As an SRE, how would you investigate to find the root cause and what metrics or telemetry would you add to avoid relying on restarts as the default fix?
EasyTechnical
24 practiced
Draft the essential sections of a runbook for the common failure: a streaming checkpoint failure for a Flink job that causes the job to restart frequently. Include steps for detection, immediate mitigations, triage, data loss risk assessment, and post-incident actions.
MediumSystem Design
28 practiced
Design monitoring and alerting for a nightly batch ETL that ingests 5 TB/day into a data warehouse. Requirements: detect missing partitions, slow job completion, and disproportionate resource consumption. Describe the metrics, alert thresholds, retention, dashboard widgets, and escalation paths you would implement.

Unlock Full Question Bank

Get access to hundreds of Data Pipeline Monitoring and Observability interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.