Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardTechnical

140 practiced

Hard: Propose a set of monitoring queries and alert logic to detect schema drift and data-type mismatches across 50+ source feeds daily. Include which metadata to capture, example SQL to detect type changes and unexpected NULL percentages, and how to prioritize alerts based on impact.

MediumTechnical

67 practiced

Design a lightweight metrics contract and documentation template for a critical KPI (e.g., Monthly Active Users). What fields should the contract include (definition, calculation SQL, freshness, upstream dependencies, alert thresholds), and how would you make it discoverable to analysts and engineers?

MediumTechnical

73 practiced

Explain why pushing restrictive filters early (pre-join) can improve performance but sometimes change results in SQL. Given sample tables and a problematic query, rewrite it to keep correctness while minimizing row shuffles on a distributed system. Discuss optimizer considerations.

EasyTechnical

79 practiced

In Excel, you have a column with formulas that sometimes return #DIV/0! or #N/A when source values are missing. Explain how you would design formulas and sheet-level checks to prevent errors from propagating into summary tables. Provide example formulas using IFERROR/IFNA or other constructs.

HardTechnical

77 practiced

Hard technical SQL: Given these tables:

sessions(session_id, user_id, session_start_ts, session_end_ts)events(event_id, session_id, event_ts, event_type)

Some events lack session_id but have user_id and event_ts. Compute user-level conversion rate defined as: number of users who had at least one 'purchase' event during any session in the week / number of users with at least one session that week. Handle missing session_id by assigning events to the nearest session for that user within 30 minutes, avoid division by zero, and deduplicate multiple purchases per user. Provide SQL (ANSI/Postgres) and explain edge-case handling.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Join thousands of developers preparing for their dream job.