Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardSystem Design

67 practiced

Design an anomaly detection and alerting system for BI metrics using statistical and ML models. Cover model choices (seasonal decomposition, STL, Prophet, isolation forest), how to deploy and schedule models, threshold tuning, how to surface alerts to dashboards and on-call, human-in-the-loop feedback for false positives, and how to handle model drift. Provide an operational plan for tuning and lifecycle management.

HardSystem Design

83 practiced

Design a scalable deduplication pipeline for probabilistic matching across 50M customer records nightly. Include blocking strategies, feature engineering for similarity (name, email, phone, address), scoring model choice (logistic/regression/learned embeddings), threshold selection, active learning/labeling process, and monitoring for precision/recall drift. Discuss latency and cost trade-offs.

HardTechnical

79 practiced

A nightly dashboard shows a sudden 40% drop in daily active users. Describe a step-by-step investigation you would perform as the BI analyst: what SQL checks you'd run, how you'd inspect upstream ingestion logs, how to check data lineage and materialized views, how to test for delayed pipelines or missing partitions, and how to communicate initial findings and expected impact to stakeholders.

MediumTechnical

76 practiced

Given tables:

events(user_id INT, event_type VARCHAR, occurred_at TIMESTAMP)

and users(user_id INT), write an ANSI SQL query to compute daily conversion_rate = conversions / exposures for the last 30 days, where exposures = count of users shown a promo and conversions = count of users who clicked. Ensure the query handles nulls, prevents division-by-zero, and emits 0.00% for days with zero exposures rather than NULL. Explain any assumptions.

HardTechnical

92 practiced

Design a schema and algorithm to monitor for schema drift and distributional changes in real time. Include how to store baseline distributions (histograms, quantiles), which statistical tests or metrics to compute (KL divergence, population percentiles, cardinality delta), sampling strategies for large tables, and how to set adaptive thresholds or alerts.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Join thousands of developers preparing for their dream job.