Data Validation and Anomaly Detection Questions
Techniques for validating data quality and detecting anomalies using SQL: identifying nulls and missing values, finding duplicates and orphan records, range checks, sanity checks across aggregates, distribution checks, outlier detection heuristics, reconciliation queries across systems, and building SQL based alerts and integrity checks. Includes strategies for writing repeatable validation queries, comparing row counts and sums across pipelines, and documenting assumptions for investigative analysis.
MediumTechnical
0 practiced
You suspect duplicate user records in a `users` table with columns (user_id PK, email, full_name, created_at). Write a SQL query (use window functions) to identify suspected duplicate groups by email (case-insensitive) and also flag cases where email differs but other fields match closely (e.g., same full_name and same phone). Provide output: group_id, representative_user_id, duplicate_count, sample_user_ids.
EasyTechnical
0 practiced
Write a SQL query to detect sudden changes in distribution between two months for a categorical column `payment_method`. Report the percentage point change per category and flag categories with absolute change > 10 percentage points. Assume table `payments(payment_id, dt, payment_method)`.
EasyTechnical
0 practiced
Write a SQL query to detect rows with inconsistent timestamps across systems: compare `source_events(event_id, ts)` to `warehouse_events(event_id, ts_ingested)` and report events where ABS(EXTRACT(EPOCH FROM ts - ts_ingested)) > 3600 (1 hour). Also propose how to handle timezone and clock skew issues.
MediumTechnical
0 practiced
Provide a plan to version and document data validation rules (SQL checks) used by your team. What metadata should be stored (owner, severity, rationale, thresholds), how to integrate with code review, and how to track changes over time and rollbacks?
MediumTechnical
0 practiced
Create a SQL-based integrity check that ensures referential integrity for a denormalized analytics table: `orders_flat(order_id, customer_id, customer_email, customer_signup_date)` where customer_* originates from `customers`. Describe how to detect stale denormalized attributes and produce a reconciliation report.
Unlock Full Question Bank
Get access to hundreds of Data Validation and Anomaly Detection interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.