InterviewStack.io LogoInterviewStack.io

Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardTechnical
73 practiced
You detect a long-running float rounding bug that caused cumulative revenue to drift over months and impacted billing. Create a remediation plan: quantify total customer impact, design deterministic fixes (store amounts as integers or decimals), reprocess historical data safely, notify stakeholders, and propose monitoring to prevent recurrence.
HardTechnical
92 practiced
You need to perform a three-month historical backfill after a schema change that affects production aggregates. Create a step-by-step plan to execute the backfill with minimal downtime, including staging strategies, idempotent writes, validation queries, throttling to limit load, and a rollback plan if the backfill introduces regressions.
HardTechnical
83 practiced
Design an entity resolution algorithm (pseudocode or Python) to merge records across sources with conflicting or missing identifiers. Include blocking, pairwise similarity scoring (name, email, phone), thresholding, and clustering. Discuss precision/recall trade-offs and how to evaluate the system without a complete gold set.
MediumSystem Design
85 practiced
Design a deduplication strategy for a high-throughput Kafka event stream where duplicates arrive due to producer retries within a 5-minute window. Explain how to detect duplicates, ensure idempotency at the consumer level, manage state (TTL, storage), and the trade-offs between memory usage, correctness, and throughput.
HardSystem Design
76 practiced
Design a versioned canonicalization microservice responsible for normalizing emails, phone numbers, and addresses used across many products. Describe API design, versioning strategy, backward compatibility guarantees, caching, latency SLA targets, rate limiting, and a migration plan for clients.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.