Data Cleaning and Quality Validation in SQL Questions
Handle NULL values, duplicates, and data type issues within queries. Implement data validation checks (row counts, value distributions, date ranges). Practice identifying and documenting data quality issues that impact analysis reliability.
MediumTechnical
0 practiced
You need to find likely duplicate customer rows using fuzzy matching on name and email. Given:Write a PostgreSQL query using pg_trgm similarity() or levenshtein to return candidate pairs with name similarity >= 0.8 OR email similarity >= 0.9. Describe the pros and cons of this approach and how you would tune thresholds to balance precision and recall for deduplication.
customers(id INT, name TEXT, email TEXT, phone TEXT)EasyTechnical
0 practiced
Given a users table:Write a single SQL query (in PostgreSQL or standard SQL) that returns rows where any critical column used for analytics is NULL (email or signup_date) and also include a column that indicates which critical columns are NULL (e.g., 'email', 'signup_date', or 'email,signup_date'). Additionally, return the total count of problematic rows as a separate row or column so an analyst can quickly see the magnitude of the problem.
users(user_id INTEGER PRIMARY KEY, email TEXT, signup_date TIMESTAMP, country TEXT)MediumTechnical
0 practiced
Explain how to implement row-level data lineage using only SQL constructs: capture source_file, source_row_id, ingestion_batch_id in staging, propagate these fields through transformations, and provide a sample SQL pattern that joins a transformed analytics row back to its originating source rows for root cause analysis. Use example table names: staging.raw_events and analytics.daily_events.
EasyTechnical
0 practiced
Design a quick reconciliation query to compare row counts between a source (staging) table and a target analytics table after an incremental ETL. The staging table contains all rows for the current partition. Describe SQL to: 1) compare counts, 2) compute percentage difference, and 3) return a PASS if difference <= 0.5% else FAIL. Use schema:
staging.events(partition_date DATE, id STRING)
warehouse.events(partition_date DATE, id STRING)EasyTechnical
0 practiced
You are responsible for nightly ETL validation for an orders table. List and then implement (as SQL snippets) at least five basic validation checks that should run after the load: e.g., row count reconciliation, null rate for required fields, max/min order_date, referential integrity to customers, and distinct count of order_id. Use the table schema:Return each check as a row with columns: check_name, expected, observed, status (PASS/FAIL).
orders(order_id STRING, customer_id STRING, amount NUMERIC, order_date DATE)Unlock Full Question Bank
Get access to hundreds of Data Cleaning and Quality Validation in SQL interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.