InterviewStack.io LogoInterviewStack.io

Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardTechnical
63 practiced
A production aggregation intermittently returns NULL for a computed column, but the same query returns values in staging. Describe a systematic debugging approach: how to sample problematic rows, check schema and engine version differences, instrument failing rows to capture context, and implement a long-term fix to avoid recurrence.
MediumSystem Design
85 practiced
Design a deduplication strategy for a high-throughput Kafka event stream where duplicates arrive due to producer retries within a 5-minute window. Explain how to detect duplicates, ensure idempotency at the consumer level, manage state (TTL, storage), and the trade-offs between memory usage, correctness, and throughput.
EasyTechnical
76 practiced
Describe a robust canonicalization strategy for identifiers such as email addresses and phone numbers before joining datasets. Give concrete normalization rules, examples (e.g., Gmail dot handling), and explain common pitfalls that can cause incorrect joins or duplicate identities.
MediumTechnical
75 practiced
You receive mixed country representations like 'US', 'usa', 'United States', and '840'. Propose a normalization pipeline using authoritative datasets (ISO 3166), fuzzy matching for ambiguous inputs, and a fallback policy for missing or invalid country codes. Include a plan for maintaining the mapping table over time.
HardTechnical
82 practiced
Design a validation suite for a mission-critical aggregated metric that pulls from multiple sources (events, logs, third-party attribution). Include unit tests, integration tests, statistical sanity checks, reconciliation jobs, and production monitors that ensure both correctness and timely alerts.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.