InterviewStack.io LogoInterviewStack.io

Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardSystem Design
0 practiced
How would you perform record linkage and deduplication while preserving privacy and complying with GDPR/PII constraints? Describe practical approaches such as hashing, salted hashing, Bloom filters, private set intersection, or secure multi-party computation, and propose a pipeline that allows analytics while minimizing exposure of raw PII. Discuss limitations and deletion (right-to-be-forgotten) handling.
HardTechnical
0 practiced
A nightly dashboard shows a sudden 40% drop in daily active users. Describe a step-by-step investigation you would perform as the BI analyst: what SQL checks you'd run, how you'd inspect upstream ingestion logs, how to check data lineage and materialized views, how to test for delayed pipelines or missing partitions, and how to communicate initial findings and expected impact to stakeholders.
MediumTechnical
0 practiced
A single product shows an extreme revenue spike on one day. Describe a reproducible investigation plan you would execute as a BI analyst: specific SQL checks, segmentation checks (by user, country, referrer), checks against raw events and ingestion logs, how to identify whether the spike is data error vs real, and how you would decide between correction and flagging. Include risk considerations.
MediumSystem Design
0 practiced
You receive customer records from CRM and e-commerce with no shared stable customer_id. Propose a practical algorithm and pipeline to merge these into a golden table: include deterministic matching keys, probabilistic matching scoring, conflict resolution rules, lineage tracking, and how to measure and surface match quality. Discuss trade-offs between precision and recall.
EasyTechnical
0 practiced
Explain the trade-offs between storing timestamps in UTC versus storing local timestamps with timezone offsets in your warehouse. For a global analytics platform used by a BI team, which strategy would you choose and why? Discuss implications for grouping by day/week, user-local reporting, and daylight saving time handling.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.