Data Quality and Anomaly Detection Questions

Focuses on identifying, diagnosing, and preventing data issues that produce misleading or incorrect metrics. Topics include spotting duplicates, missing values, schema drift, logical inconsistencies, extreme outliers caused by instrumentation bugs, data latency and pipeline failures, and reconciliation differences between sources. Covers validation strategies such as data tests, checksums, row counts, data contracts, invariants, and automated alerting for quality metrics like completeness, accuracy, and timeliness. Also addresses investigation workflows to determine whether anomalies are data problems versus true business signals, documenting remediation steps, and collaborating with engineering and product teams to fix upstream causes.

EasyTechnical

84 practiced

Compare three strategies to deduplicate customer records: SQL row_number partition by key, hashing + grouping dedupe, and application-side fuzzy matching (e.g., name/address similarity). For each method list strengths, weaknesses, performance characteristics, and situations where you'd choose it.

MediumTechnical

89 practiced

Schema drift: describe a practical process to detect schema drift across daily ingests, how to classify drift as backward-compatible vs breaking, and the mitigation steps (alerting, auto-casting, blocking ingestion). Give examples of safe and unsafe schema changes.

HardTechnical

68 practiced

Create a prioritized checklist for rolling out production data-quality alerts across multiple teams. Include onboarding steps, runbook templates, SLAs for response, dashboards to track adoption and effectiveness, and a feedback loop to tune checks and thresholds.

MediumTechnical

66 practiced

You need to reconcile customer records across two systems where keys are not unique and some attributes are missing. Sketch an approach that combines deterministic joins, blocking, and fuzzy matching. Provide pseudocode or SQL sketches for the progressive matching tiers and how you would measure match confidence.

HardTechnical

63 practiced

Case study: You discover a 40% discrepancy between billing totals reported by the billing system and the analytics warehouse for the same invoice period. Describe a step-by-step reconciliation plan including aggregate checks, partition-level diffs, sampling, potential causes to investigate (timezone, rounding, currency, duplicate counts), SQL queries or checks you'd run, and how you'd present findings to finance.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Anomaly Detection interview questions and detailed answers.

Join thousands of developers preparing for their dream job.