InterviewStack.io LogoInterviewStack.io

Data Quality and Edge Case Handling Questions

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

HardTechnical
0 practiced
Hard leadership: Create an ownership and remediation workflow for data quality alerts. Define roles (analyst, data engineer, product owner), SLAs for triage and remediation, rollback procedures for dashboards, and how to communicate impact and risk to executive stakeholders. Include example runbooks for a common data-quality alert (e.g., 50% drop in daily active users).
MediumTechnical
0 practiced
Explain why pushing restrictive filters early (pre-join) can improve performance but sometimes change results in SQL. Given sample tables and a problematic query, rewrite it to keep correctness while minimizing row shuffles on a distributed system. Discuss optimizer considerations.
MediumTechnical
0 practiced
Two sources provide overlapping customer profiles. Source A has (customer_id, email, phone, updated_at), Source B has (external_id, email, phone, address, last_seen). Write an SQL-based consolidation logic that prioritizes non-null fields from the most recently updated source and falls back to the other source when fields are missing. Show sample rows and the expected merged output.
MediumSystem Design
0 practiced
Design an automated set of daily data quality checks for a key metrics ETL job (e.g., daily active users, new signups, revenue). Include checks for schema changes, row-count anomalies, null rate thresholds, cardinality changes, and value-range checks. Describe where to store results, how to alert, and how to triage false positives.
EasyTechnical
0 practiced
Given a table events(event_id, user_id, event_ts, event_type) with possible duplicate ingestion rows, write a SQL query (ANSI SQL / PostgreSQL) to identify duplicate events and then produce a deduplicated table using ROW_NUMBER() so you keep the earliest event based on event_ts. Include schema and sample data in your explanation.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Edge Case Handling interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.