Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardTechnical

93 practiced

You're responsible for defining SLAs and error budgets for dataset freshness and accuracy across the analytics platform. Propose measurable SLOs for freshness, completeness, and accuracy, describe how to compute an error budget, and explain how teams should act when error budgets are exhausted.

MediumTechnical

82 practiced

A data provider sometimes sends a single 'summary' row with totals instead of full event rows. Your downstream ETL blindly unions files and aggregates, double-counting metrics. Propose a detection and ingestion strategy to identify and exclude these summary rows automatically while preserving true data rows.

MediumTechnical

72 practiced

Implement a SQL pattern to detect duplicate transactions where duplicates are defined as same customer_id, amount, and timestamp within a one-minute window. Provide a query using standard SQL that flags duplicates and keeps the earliest insertion_id as canonical.

EasyTechnical

72 practiced

Explain how window functions behave when partitions contain zero or one row. For example, what does ROW_NUMBER(), LAG(), and AVG() windowed over a partition return in these small partitions? How would you guard analytical logic that expects at least 2 rows per partition?

HardTechnical

84 practiced

Implement a SQL pattern to compute a conversion rate defined as conversions / exposures per day, guarding against division by zero, users with no exposures, and time zone boundaries. Assume tables exposures(user_id, exposure_time) and conversions(user_id, conversion_time). Use standard SQL and include behaviors for users with conversions but no prior exposures.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Join thousands of developers preparing for their dream job.