Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardTechnical

0 practiced

Explain options for computing daily 95th percentile latency over 10 billion rows per month: exact SQL sorts, approximate algorithms (t-digest, reservoir sampling), and database-specific approximate quantiles. For each, describe accuracy vs cost trade-offs and how to ensure audited reproducibility when finance or compliance depends on the result.

HardTechnical

0 practiced

Design an entity resolution algorithm (pseudocode or Python) to merge records across sources with conflicting or missing identifiers. Include blocking, pairwise similarity scoring (name, email, phone), thresholding, and clustering. Discuss precision/recall trade-offs and how to evaluate the system without a complete gold set.

MediumTechnical

0 practiced

You find customer IDs are inconsistent: some sources include leading zeros, some strip them, and some use numeric vs string types. Propose a normalization and migration strategy that creates canonical IDs without breaking downstream consumers, including detection methods for affected datasets and a rollback plan.

MediumTechnical

0 practiced

Design a test dataset and schema to validate address canonicalization across international formats. Include sample rows that cover edge cases (multi-line addresses, special characters, missing components, non-Latin scripts) and explain why each case is important for unit and integration tests.

EasyTechnical

0 practiced

You have a SQL aggregation that sometimes returns an empty result set. Describe how downstream Python code that consumes the result should be defensive to avoid crashes or incorrect calculations. Provide a short Python example that consumes a query result and applies sensible defaults.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Join thousands of developers preparing for their dream job.