Data Validation and Anomaly Detection Questions

Techniques for validating data quality and detecting anomalies using SQL: identifying nulls and missing values, finding duplicates and orphan records, range checks, sanity checks across aggregates, distribution checks, outlier detection heuristics, reconciliation queries across systems, and building SQL based alerts and integrity checks. Includes strategies for writing repeatable validation queries, comparing row counts and sums across pipelines, and documenting assumptions for investigative analysis.

MediumTechnical

0 practiced

You suspect duplicate user records in a `users` table with columns (user_id PK, email, full_name, created_at). Write a SQL query (use window functions) to identify suspected duplicate groups by email (case-insensitive) and also flag cases where email differs but other fields match closely (e.g., same full_name and same phone). Provide output: group_id, representative_user_id, duplicate_count, sample_user_ids.

EasyTechnical

0 practiced

Write a SQL query to detect sudden changes in distribution between two months for a categorical column `payment_method`. Report the percentage point change per category and flag categories with absolute change > 10 percentage points. Assume table `payments(payment_id, dt, payment_method)`.

EasyTechnical

0 practiced

Write a SQL query to detect rows with inconsistent timestamps across systems: compare `source_events(event_id, ts)` to `warehouse_events(event_id, ts_ingested)` and report events where ABS(EXTRACT(EPOCH FROM ts - ts_ingested)) > 3600 (1 hour). Also propose how to handle timezone and clock skew issues.

MediumTechnical

0 practiced

Provide a plan to version and document data validation rules (SQL checks) used by your team. What metadata should be stored (owner, severity, rationale, thresholds), how to integrate with code review, and how to track changes over time and rollbacks?

MediumTechnical

0 practiced

Create a SQL-based integrity check that ensures referential integrity for a denormalized analytics table: `orders_flat(order_id, customer_id, customer_email, customer_signup_date)` where customer_* originates from `customers`. Describe how to detect stale denormalized attributes and produce a reconciliation report.

Unlock Full Question Bank

Get access to hundreds of Data Validation and Anomaly Detection interview questions and detailed answers.

Join thousands of developers preparing for their dream job.