Data Quality and Anomaly Detection Questions

Focuses on identifying, diagnosing, and preventing data issues that produce misleading or incorrect metrics. Topics include spotting duplicates, missing values, schema drift, logical inconsistencies, extreme outliers caused by instrumentation bugs, data latency and pipeline failures, and reconciliation differences between sources. Covers validation strategies such as data tests, checksums, row counts, data contracts, invariants, and automated alerting for quality metrics like completeness, accuracy, and timeliness. Also addresses investigation workflows to determine whether anomalies are data problems versus true business signals, documenting remediation steps, and collaborating with engineering and product teams to fix upstream causes.

MediumTechnical

79 practiced

Propose an approach to monitor column-level data stability across hundreds of columns and many tables, tracking value distributions, cardinality changes, and null rates. Include how to store these metrics, detect drift, set thresholds (static or dynamic), and surface anomalies to dataset owners with contextual links.

EasyBehavioral

89 practiced

Tell me about a time when you discovered incorrect metrics in a dashboard. Using the STAR format, describe the situation, the analysis you performed (data sources, reconciliation steps), how you documented and communicated the issue, and the outcome including remediation and lessons learned.

EasyTechnical

86 practiced

Write an ANSI SQL query to identify likely duplicate user records in table users(id PK, email varchar, name varchar, created_at timestamp). Duplicates are defined by identical normalized emails (lowercase, trimmed) or exact matches on a normalized name. Return groups with duplicate_count, list of ids, and earliest_id to keep. Explain assumptions about nulls and uniqueness.

HardSystem Design

74 practiced

Design an algorithm and pipeline to create a canonical customer profile from multiple sources (CRM, orders, support) with conflicting attributes, ensuring idempotency, auditable decision rules, and the ability to reprocess historical data. Describe deduplication, conflict resolution policies, and the storage format for the canonical profile including provenance metadata.

EasyTechnical

90 practiced

Write an ANSI SQL query to compute daily row count deltas for table events(event_id, user_id, event_type, occurred_at) grouped by event_type, showing percent change versus the previous day and flagging rows where absolute percent change > 50% or total rows < 100. Explain assumptions about timezone and late-arriving events.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Anomaly Detection interview questions and detailed answers.

Join thousands of developers preparing for their dream job.