Data Cleaning and Quality Validation in SQL Questions
Handle NULL values, duplicates, and data type issues within queries. Implement data validation checks (row counts, value distributions, date ranges). Practice identifying and documenting data quality issues that impact analysis reliability.
MediumTechnical
72 practiced
You need to merge approximate duplicates into a canonical customer record while preserving history via SCD Type 2. Given customers with potential duplicates, describe SQL steps to: 1) identify canonical record (by most recent verified_at), 2) insert new canonical records into a dimension table with surrogate keys, and 3) create history rows with effective_from/effective_to. Provide example SQL snippets using window functions and MERGE/INSERT statements.
HardSystem Design
87 practiced
Design a low-latency streaming data validation architecture for event data using Kafka and stream processors (ksqlDB, Kafka Streams, or Spark Structured Streaming). Requirements: per-event schema/type validation, detect and deduplicate retries within a 5-minute window, route invalid events to a DLQ for later inspection, and emit metrics to a monitoring system. Describe how SQL-like stream processing or SQL UDFs can be used to implement real-time validations and what trade-offs exist compared to batch SQL checks.
MediumTechnical
81 practiced
You discovered that numeric amounts are stored as text and include formatting like '$1,234.56' and '(1,234.56)' for negatives. Write PostgreSQL SQL to clean and cast amount_text into a numeric column 'amount', correctly handling commas, currency symbols, and negative values in parentheses. Show how you'd surface rows that still fail casting after cleaning.
EasyTechnical
67 practiced
Explain how SQL represents NULL values and how NULL differs from an empty string or zero. In PostgreSQL (or standard SQL) provide concrete examples: 1) demonstrate that comparisons like col = NULL do not behave as equality and show the correct use of IS NULL / IS NOT NULL; 2) show how aggregate functions (COUNT, SUM, AVG) treat NULLs; 3) show examples using COALESCE and NULLIF to provide defaults or convert sentinel values. Include a small sample table schema and at least three SELECT examples that illustrate the behaviors.
MediumTechnical
92 practiced
Explain how to implement row-level data lineage using only SQL constructs: capture source_file, source_row_id, ingestion_batch_id in staging, propagate these fields through transformations, and provide a sample SQL pattern that joins a transformed analytics row back to its originating source rows for root cause analysis. Use example table names: staging.raw_events and analytics.daily_events.
Unlock Full Question Bank
Get access to hundreds of Data Cleaning and Quality Validation in SQL interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.