Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

MediumTechnical

0 practiced

Define an automated remediation workflow for a common data quality failure: a sudden spike in nulls for a critical column. Include detection, automatic mitigation steps (such as switching to a previous snapshot or using default values), alerting and human-in-the-loop escalation, replay/backfill strategy, and validation checks to confirm recovery.

HardSystem Design

0 practiced

Design a remediation orchestration system that can safely replay failed message subsets from a dead-letter queue into an idempotent pipeline, supports throttled replay, tracks progress per batch, and provides visibility for operators. Include data models, APIs, operational controls, safety checks (idempotency tokens, max retries), and permissioning to avoid accidental duplicate side effects.

HardSystem Design

0 practiced

Design an end-to-end test harness that can validate pipeline changes against production-like data without writing to production datasets. Include strategies for data masking, selecting representative subsets that preserve distribution, running parallel shadow/canary pipelines, diffing outputs while tolerating nondeterminism, and infrastructure and safety measures to prevent accidental write-through.

EasyTechnical

0 practiced

Give three lightweight statistical heuristics an SRE can use to detect outliers in numeric ingestion metrics at runtime (for example: rolling z-score, interquartile range fence, median absolute deviation). For each heuristic describe when it's preferable, its computational cost, and how it behaves with skewed or heavy-tailed data.

EasyTechnical

0 practiced

Explain common pitfalls when handling timestamps across distributed services in multiple timezones. As an SRE, state the concrete timestamp rules you would enforce (UTC storage, timezone-aware APIs, validation), how to handle daylight saving transitions, and concrete tests you would add to catch off-by-one-day aggregation errors.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Join thousands of developers preparing for their dream job.