Data Cleaning and Quality Validation in SQL Questions

Handle NULL values, duplicates, and data type issues within queries. Implement data validation checks (row counts, value distributions, date ranges). Practice identifying and documenting data quality issues that impact analysis reliability.

MediumTechnical

0 practiced

Design SQL-based alert rules for critical data quality checks and categorize severity. For example: null_rate(order_date) > 5% = HIGH, row_count drift > 2% = MEDIUM, ingestion lag > 60 minutes = HIGH. Provide sample SQL that computes the current status for these three rules against an orders ingestion metadata table, and describe when an alert should escalate from email to on-call paging.

MediumTechnical

0 practiced

Create a query that computes daily null rate for a column 'event_value' in an events table and then computes a 7-day rolling average null rate per day using window functions. Table:

events(event_date DATE, event_value NUMERIC)

Return columns: event_date, null_rate, rolling_null_rate_7d. Use standard SQL (PostgreSQL syntax ok).

MediumTechnical

0 practiced

Event timestamps arrive as strings with timezone offsets from multiple producers, e.g., '2024-10-05T13:45:00-07:00' or '2024-10-06 21:00:00 UTC'. Write SQL (Postgres or BigQuery) to parse the timestamp strings and normalize them to TIMESTAMP WITH TIME ZONE (UTC). Also write a query to find rows where parsing fails, for manual inspection.

HardSystem Design

0 practiced

Design a fault-tolerant, scalable SQL-first data quality framework for a cloud data warehouse (e.g., BigQuery or Snowflake) that must run checks across 1000 tables nightly within 2 hours. Describe the architecture (orchestration, storage for results, templates for checks), how checks are defined and parametrized in SQL, how to optimize compute cost, and how to store historical DQ metrics for trending.

EasyTechnical

0 practiced

Given a logs table where duplicates are defined by (user_id, event_type, event_date), design a SQL statement to deduplicate the table, keeping only the row with the most recent updated_at. Table:

user_events(id BIGINT, user_id INT, event_type TEXT, event_date DATE, updated_at TIMESTAMP)

Provide a DELETE using a window function (PostgreSQL syntax) that removes duplicates while keeping the canonical row.

Unlock Full Question Bank

Get access to hundreds of Data Cleaning and Quality Validation in SQL interview questions and detailed answers.

Join thousands of developers preparing for their dream job.