InterviewStack.io LogoInterviewStack.io

Data Quality and Validation Questions

Covers the core concepts and hands on techniques for detecting, diagnosing, and preventing data quality problems. Topics include common data issues such as missing values, duplicates, outliers, incorrect labels, inconsistent formats, schema mismatches, referential integrity violations, and distribution or temporal drift. Candidates should be able to design and implement validation checks and data profiling queries, including schema validation, column level constraints, aggregate checks, distinct counts, null and outlier detection, and business logic tests. This topic also covers the mindset of data validation and exploration: how to approach unfamiliar datasets, validate calculations against sources, document quality rules, decide remediation strategies such as imputation quarantine or alerting, and communicate data limitations to stakeholders.

MediumSystem Design
34 practiced
How would you manage schema evolution in a data warehouse used for retraining models when new fields are added, renamed, or removed? Describe versioning strategy, backward/forward compatibility rules, migration approach, and communication plan for downstream teams.
MediumSystem Design
43 practiced
Design a deduplication strategy for a user profile system that receives both batch imports and streaming updates. Requirements: support eventual consistency, avoid lost updates, and be efficient for 200M users. Sketch data model, conflict resolution rules, and an approach to merging attributes from both sources.
HardSystem Design
37 practiced
Design a CI/CD pipeline for data validation that gates schema changes and transformation code before deployment. Include test types (unit tests, integration tests with sample data, contract checks), canary ingestion for new versions, rollback strategy, and how to surface validation failures to developers.
HardSystem Design
37 practiced
Design a reconciliation and canonicalization process for master data where different sources provide conflicting values for customer attributes (name, address, vip_status). Requirements: explain deduplication, trust scoring for sources, conflict resolution rules, audit trail, and how to roll back incorrect canonicalization.
EasyTechnical
63 practiced
Given the SQL table transactions(transaction_id PK, user_id INT, amount DECIMAL, occurred_at TIMESTAMP), write a SQL query that returns, for each column: total rows, null_count, distinct_count, and percent_null. Explain assumptions about large data volumes and sampling.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Validation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.