Data Validation and Anomaly Detection Questions
Techniques for validating data quality and detecting anomalies using SQL: identifying nulls and missing values, finding duplicates and orphan records, range checks, sanity checks across aggregates, distribution checks, outlier detection heuristics, reconciliation queries across systems, and building SQL based alerts and integrity checks. Includes strategies for writing repeatable validation queries, comparing row counts and sums across pipelines, and documenting assumptions for investigative analysis.
EasyTechnical
0 practiced
You have a Kafka topic of events that should have sequential event_id per user. Describe a lightweight SQL-based approach (for ksqlDB or BigQuery streaming SQL) to detect gaps in event_id per user in near real-time. Explain handling of out-of-order arrivals, late data, and choosing a tolerable lateness watermark.
HardTechnical
0 practiced
You must run heavy validation queries nightly over petabytes of partitioned data. Propose optimizations to reduce cost and runtime: partition pruning, incremental checks, sampling strategies, approximate aggregates, precomputed metadata, and separation of fast/slow checks. Provide example SQL/pseudocode and discuss trade-offs around accuracy and cost.
HardTechnical
0 practiced
Propose a governance model and a data contract specification that enforces validation checkpoints between producer and consumer teams. Include contract fields (name, type, constraints), versioning policy, compatibility rules, validation tooling, and how to handle contract violations in CI and production.
HardTechnical
0 practiced
Total revenue for a region increased 300% yesterday. Provide a prioritized set of investigative SQL queries and sampling strategies you would run to narrow down the cause. Include checks against raw events, hourly aggregates, product SKUs, promotions table joins, timezone and currency issues, and sampling a subset of suspicious rows for manual inspection. Show short example snippets or pseudocode.
EasyTechnical
0 practiced
Write a SQL query to sanity-check daily revenue for the last 30 days. Given orders(order_id, user_id, amount numeric, occurred_at date), compute for each date: total_revenue, 7_day_moving_avg, pct_change_vs_moving_avg and flag dates where pct_change > 30% or < -30%. Use window functions and explain assumptions about missing days.
Unlock Full Question Bank
Get access to hundreds of Data Validation and Anomaly Detection interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.