InterviewStack.io LogoInterviewStack.io

Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

MediumTechnical
126 practiced
Compare three deduplication approaches at scale: (1) exact dedupe using an external store like Redis or a DB, (2) windowed stateful dedupe inside a stream processor, and (3) probabilistic dedupe using Bloom filters. For each approach discuss space/time costs, false-positive/negative behavior, restart and replay behavior, and recommended use-cases.
HardTechnical
81 practiced
As a principal SRE, draft a high-level rollout plan to adopt data quality SLOs company-wide. Include policy definition, required tooling, incentives and KPIs for teams, exceptions and governance, training, and metrics to measure adoption and ROI. Discuss how you would balance engineering effort against business value and propose a phased timeline.
MediumTechnical
85 practiced
Describe how to automate anomaly detection for key pipeline metrics (ingest rate, schema changes, null-rate) using statistical methods or light-weight ML. Cover feature extraction, method selection (stat tests vs simple models), runtime constraints, false-positive management, and the runbook actions SREs should take when anomalies are detected.
MediumSystem Design
80 practiced
Design metrics and alerting for a nightly ETL job to detect data quality regressions. Specify concrete metrics (row counts, null-rate per column, schema hash, distribution divergence), suggested alert thresholds or adaptive alerting logic, and an escalation policy. Explain how to set SLOs and error budgets for freshness and completeness of nightly data.
MediumTechnical
66 practiced
Implement in Python a generator function `dedupe(records, key_fn, capacity)` that consumes an iterator of dict-like records and yields unique records by key_fn. The generator must use bounded memory and evict least-recently-seen keys when capacity is exceeded (LRU eviction). Provide code and explain time/space complexity and tradeoffs when used on a high-throughput stream.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.