InterviewStack.io LogoInterviewStack.io

Data Validation and Anomaly Detection Questions

Techniques for validating data quality and detecting anomalies using SQL: identifying nulls and missing values, finding duplicates and orphan records, range checks, sanity checks across aggregates, distribution checks, outlier detection heuristics, reconciliation queries across systems, and building SQL based alerts and integrity checks. Includes strategies for writing repeatable validation queries, comparing row counts and sums across pipelines, and documenting assumptions for investigative analysis.

MediumTechnical
0 practiced
Create an ETL test plan for a new daily pipeline that ingests transaction CSVs into a dimensional schema. Include unit tests, integration tests, regression tests, acceptance criteria, sample data fixtures, and how to automate these tests in CI/CD so that merges cannot be deployed if a data-quality test fails.
EasyTechnical
0 practiced
Describe how z-score based outlier detection works for a numeric column and outline a simple SQL implementation to compute z-scores and flag outliers. State key assumptions behind z-score detection, when it is appropriate, and at least two reasons you might prefer an IQR-based method instead.
MediumTechnical
0 practiced
Write a PostgreSQL query to flag outliers in sales.sale_amount using the IQR method. Compute Q1 and Q3, derive IQR, and then select rows where sale_amount < Q1 - 1.5 * IQR or sale_amount > Q3 + 1.5 * IQR. Also show how you would modify the query to compute IQR per product category and how to handle small partitions with fewer than 30 rows.
MediumTechnical
0 practiced
Your analytics warehouse disables foreign key enforcement for performance. How would you validate referential integrity between orders and customers for a 100M-row orders table periodically? Describe SQL patterns, sampling strategies, incremental checks, and performance optimizations to detect or estimate the number of broken references efficiently.
HardTechnical
0 practiced
You must reconcile financial flows across three related tables: payments(payment_id, invoice_id, amount), invoices(invoice_id, amount_due), and ledger_entries(entry_id, invoice_id, amount). Each source lags at different times and may contain duplicates. Describe an approach and SQL patterns to reconcile these tables, allow for timing differences, compute net mismatches, and produce a prioritized action list to resolve discrepancies.

Unlock Full Question Bank

Get access to hundreds of Data Validation and Anomaly Detection interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.