InterviewStack.io LogoInterviewStack.io

Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardSystem Design
83 practiced
Design an end-to-end test harness that can validate pipeline changes against production-like data without writing to production datasets. Include strategies for data masking, selecting representative subsets that preserve distribution, running parallel shadow/canary pipelines, diffing outputs while tolerating nondeterminism, and infrastructure and safety measures to prevent accidental write-through.
MediumSystem Design
80 practiced
Design metrics and alerting for a nightly ETL job to detect data quality regressions. Specify concrete metrics (row counts, null-rate per column, schema hash, distribution divergence), suggested alert thresholds or adaptive alerting logic, and an escalation policy. Explain how to set SLOs and error budgets for freshness and completeness of nightly data.
MediumSystem Design
87 practiced
You must backfill a derived column for the last 3 years in a petabyte-scale data warehouse without impacting production jobs. As SRE, create a backfill plan covering partitioning, batching, throttling, idempotency, resumability, validation checks, rollback strategies, and an estimate of resource usage and time. Explain how to communicate and coordinate with stakeholders.
EasyTechnical
77 practiced
Design a unit test (pseudocode or using a testing framework you know) that verifies a pipeline transformation correctly handles single-row and zero-row inputs. Include assertions for correctness of output, emitted metrics (counts, null-rates), and that no runtime exceptions are thrown. Describe test data, expected results, and why these cases are important.
MediumTechnical
85 practiced
Describe how to automate anomaly detection for key pipeline metrics (ingest rate, schema changes, null-rate) using statistical methods or light-weight ML. Cover feature extraction, method selection (stat tests vs simple models), runtime constraints, false-positive management, and the runbook actions SREs should take when anomalies are detected.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.