InterviewStack.io LogoInterviewStack.io

Data Quality Debugging and Root Cause Analysis Questions

Focuses on investigative approaches and operational practices used when data or metrics are incorrect. Includes techniques for triage and root cause analysis such as comparing to historical baselines, segmenting data by dimensions, validating upstream sources and joins, replaying pipeline stages, checking pipeline timing and delays, and isolating schema change impacts. Candidates should discuss systematic debugging workflows, test and verification strategies, how to reproduce issues, how to build hypotheses and tests, and how to prioritize fixes and communication when incidents affect downstream consumers.

MediumTechnical
0 practiced
A stream of events from one source is producing duplicates only for a subset of partitions. Describe instrumentation and debugging steps you would implement to isolate whether the issue is producer retries, broker replays, consumer rebalances, or an at-least-once sink. Include log fields and metrics you would add.
HardTechnical
0 practiced
Implement an algorithm (describe in Python pseudocode) that computes lateness of events using event_ts and ingestion_ts, aggregates percent-late per 1-minute tumbling window with a watermark tolerance of N seconds, and emits alerts when percent-late exceeds a threshold. Assume out-of-order arrivals and bounded lateness.
MediumTechnical
0 practiced
An ML feature suddenly contains nulls for many users after a nightly job. Describe a practical debugging sequence to isolate whether the nulls were introduced by schema changes upstream, a transformation bug, a timing/regional delay, or storage corruption. Include quick checks and ways to reproduce the issue reliably.
EasyTechnical
0 practiced
Given a table events(event_id UUID PRIMARY KEY, user_id BIGINT, event_type TEXT, event_ts TIMESTAMP, source TEXT), write a SQL query to compute the daily null rate for user_id over the last 60 days grouped by source, and flag days where null rate > mean + 3*stddev computed over the preceding 28-day window. Explain assumptions about small sample sizes and how you'd avoid false positives.
HardSystem Design
0 practiced
Design a system to version datasets and enable reproducible training: include storage choices for snapshots, metadata catalog, hooks for capturing training inputs, and a mechanism to rehydrate a dataset for retraining while controlling storage costs.

Unlock Full Question Bank

Get access to hundreds of Data Quality Debugging and Root Cause Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.