Data Validation and Anomaly Detection Questions
Techniques for validating data quality and detecting anomalies using SQL: identifying nulls and missing values, finding duplicates and orphan records, range checks, sanity checks across aggregates, distribution checks, outlier detection heuristics, reconciliation queries across systems, and building SQL based alerts and integrity checks. Includes strategies for writing repeatable validation queries, comparing row counts and sums across pipelines, and documenting assumptions for investigative analysis.
MediumTechnical
0 practiced
Explain how you'd define and compute a Data Quality Index (DQI) for a dataset that aggregates multiple checks (null rate, duplicate rate, schema drift, freshness). What weighting scheme would you choose, how to surface DQI over time, and how to set thresholds for action?
MediumTechnical
0 practiced
You suspect duplicate user records in a `users` table with columns (user_id PK, email, full_name, created_at). Write a SQL query (use window functions) to identify suspected duplicate groups by email (case-insensitive) and also flag cases where email differs but other fields match closely (e.g., same full_name and same phone). Provide output: group_id, representative_user_id, duplicate_count, sample_user_ids.
MediumTechnical
0 practiced
Implement a Spark (PySpark) job that deduplicates a large orders dataset based on `order_id`, keeping the row with latest `updated_at`. Provide the essential transformation code (assume DataFrame API), explain partitioning and shuffle considerations, and how to validate deduplication using checksums or counts before and after.
MediumTechnical
0 practiced
Design a small test harness for data quality checks that runs locally or in CI: what components (sample data, assertions, runners, reporting) would you include, and how would you mock external data sources? Provide an example test case that asserts null rate < 1% on a column for a sample dataset.
HardTechnical
0 practiced
You operate a pipeline that switches from batch to micro-batch ingestion. How would you ensure data quality checks continue to provide meaningful alerts with smaller, frequent batches? Discuss windowing, aggregation latency, and strategies to avoid alert flapping.
Unlock Full Question Bank
Get access to hundreds of Data Validation and Anomaly Detection interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.