InterviewStack.io LogoInterviewStack.io

Data Quality and Anomaly Detection Questions

Focuses on identifying, diagnosing, and preventing data issues that produce misleading or incorrect metrics. Topics include spotting duplicates, missing values, schema drift, logical inconsistencies, extreme outliers caused by instrumentation bugs, data latency and pipeline failures, and reconciliation differences between sources. Covers validation strategies such as data tests, checksums, row counts, data contracts, invariants, and automated alerting for quality metrics like completeness, accuracy, and timeliness. Also addresses investigation workflows to determine whether anomalies are data problems versus true business signals, documenting remediation steps, and collaborating with engineering and product teams to fix upstream causes.

HardSystem Design
91 practiced
Design an enterprise-grade data quality platform to monitor 1000+ tables across both batch and streaming pipelines. Your design should cover metadata ingestion, rule catalog and execution, lineage, time-series metrics store, alerting/escation, integration with ticketing, and multi-tenant concerns. Assume streaming throughput up to 10M events/sec and batch data ~2 TB/day.
EasyTechnical
80 practiced
What metrics and indicators would you use to detect data latency for daily batch pipelines? Propose a simple alerting rule for: 'daily orders table not updated within 2 hours of expected arrival'. Explain measurement methods (latest ingest timestamp, row counts, max event timestamp) and brief pros/cons.
EasyTechnical
92 practiced
Given a table `orders(order_id, customer_id, created_at, total_amount)` write a SQL query (ANSI SQL) to identify likely duplicate orders where the same customer_id and total_amount appear within 1 minute of each other. Explain how you would reduce false positives and scale this check for a large historical table.
HardTechnical
73 practiced
Propose an automated remediation framework for common data-quality issues: missing partitions, negative amounts, and duplicate events. For each issue describe automatic fix logic (if safe), backfill strategy, risk assessment, audit logging, and criteria for when to require human review instead of auto-fixing.
EasyTechnical
75 practiced
Explain what a data contract is in plain language. Show a compact example of a data contract for a `product_catalog` topic that specifies field names, types, required vs optional, and one invariant (e.g., price >= 0). Explain how this contract helps downstream analysts.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Anomaly Detection interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.