Data Quality and Validation Questions

Covers the core concepts and hands on techniques for detecting, diagnosing, and preventing data quality problems. Topics include common data issues such as missing values, duplicates, outliers, incorrect labels, inconsistent formats, schema mismatches, referential integrity violations, and distribution or temporal drift. Candidates should be able to design and implement validation checks and data profiling queries, including schema validation, column level constraints, aggregate checks, distinct counts, null and outlier detection, and business logic tests. This topic also covers the mindset of data validation and exploration: how to approach unfamiliar datasets, validate calculations against sources, document quality rules, decide remediation strategies such as imputation quarantine or alerting, and communicate data limitations to stakeholders.

HardTechnical

0 practiced

Implement a scalable process to compute Population Stability Index (PSI) per feature across daily production vs baseline using PySpark. Requirements: support continuous and categorical features, handle nulls, and compute incremental updates from daily partitions without full recompute.

HardTechnical

0 practiced

Propose an algorithmic approach (and practical implementation sketch) to detect label noise at scale by modeling annotator reliability (e.g., Dawid-Skene), building an agreement graph, and using this to re-weight or relabel training examples. Discuss compute costs and incremental updates.

MediumTechnical

0 practiced

Explain how data lineage helps debug a sudden model regression. Describe how you would instrument and query lineage metadata to trace a problematic training example from origin (raw source) through transformations to model input, and identify potential breakages.

MediumTechnical

0 practiced

Write a PySpark job skeleton that enforces a given schema on a Parquet dataset of size ~1TB. Requirements: fail-fast on missing required columns, log rows that fail type coercion to a quarantine store, and produce a summary report (counts of errors per column). Show code structure and key APIs you would use.

HardSystem Design

0 practiced

Design monitoring to detect concept drift concentrated in specific subpopulations (e.g., by region or device). Include metric definitions, partitioning strategy, efficient computation for high cardinality, and automated triggers for targeted retraining.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Validation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.