Data Validation and Anomaly Detection Questions
Techniques for validating data quality and detecting anomalies using SQL: identifying nulls and missing values, finding duplicates and orphan records, range checks, sanity checks across aggregates, distribution checks, outlier detection heuristics, reconciliation queries across systems, and building SQL based alerts and integrity checks. Includes strategies for writing repeatable validation queries, comparing row counts and sums across pipelines, and documenting assumptions for investigative analysis.
MediumTechnical
0 practiced
You need to reconcile transactions across systems A and B where amounts are stored in different currencies and rounding rules differ. Propose an SQL-based reconciliation approach that matches by transaction_id when present, falls back to fuzzy matching (user_id + date + amount within tolerance), applies historical FX conversion, accounts for rounding tolerances, and emits unmatched records and summary deltas.
EasyTechnical
0 practiced
Write a PostgreSQL query to identify columns in the events table that have more than 5% NULL or missing values for the last 30 days. Table schema:Return: column_name, null_count, total_count, null_percentage for events where occurred_at >= current_date - 30.
events(
event_id bigint PRIMARY KEY,
user_id bigint,
event_type text,
amount numeric,
occurred_at timestamp
)MediumTechnical
0 practiced
Write a SQL query that flags days with unusually high transaction counts using a rolling 14-day mean and standard deviation. Table: transactions(transaction_id, occurred_at date). Detect days where count > mean + 3 * stddev computed over the prior 14 days (exclude current day from baseline). Return date, count, rolling_mean, rolling_stddev, z_score, is_spike.
MediumTechnical
0 practiced
Implement a Python function detect_mad_outliers(values: List[float], threshold: float = 3.5) -> List[int] that returns indices of outliers using Median Absolute Deviation (MAD). The function should ignore NaNs for computations but return indices relative to original list, and handle small sample sizes gracefully. Include a short docstring describing complexity and behavior.
HardTechnical
0 practiced
In distributed pipelines different consumers can process events in different orders producing divergent aggregates. Propose reconciliation algorithms tolerant to ordering differences and suitable for large datasets. Discuss use of event-time windows, commutative/associative aggregations, watermarking, and compensation records for corrections.
Unlock Full Question Bank
Get access to hundreds of Data Validation and Anomaly Detection interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.