InterviewStack.io LogoInterviewStack.io

Data Quality and Anomaly Detection Questions

Focuses on identifying, diagnosing, and preventing data issues that produce misleading or incorrect metrics. Topics include spotting duplicates, missing values, schema drift, logical inconsistencies, extreme outliers caused by instrumentation bugs, data latency and pipeline failures, and reconciliation differences between sources. Covers validation strategies such as data tests, checksums, row counts, data contracts, invariants, and automated alerting for quality metrics like completeness, accuracy, and timeliness. Also addresses investigation workflows to determine whether anomalies are data problems versus true business signals, documenting remediation steps, and collaborating with engineering and product teams to fix upstream causes.

MediumTechnical
0 practiced
Business users want up-to-date numbers but your source frequently provides only partial-day loads. Propose dashboard design and labeling practices (Power BI or Looker) to prevent misinterpretation and communicate data freshness, confidence, and partiality to end users while keeping dashboards useful.
HardSystem Design
0 practiced
Design an algorithm and pipeline to create a canonical customer profile from multiple sources (CRM, orders, support) with conflicting attributes, ensuring idempotency, auditable decision rules, and the ability to reprocess historical data. Describe deduplication, conflict resolution policies, and the storage format for the canonical profile including provenance metadata.
MediumTechnical
0 practiced
You have two systems: payments(payment_id, order_id, amount_cents, status, created_at) and orders(order_id, user_id, amount_cents, created_at). Write an ANSI SQL query that finds orders in the last 30 days with missing payments or mismatched amounts, and produce a reconciliation status column (ok/mismatch/missing). Explain performance considerations for large tables.
HardTechnical
0 practiced
You must reconcile two 1TB nightly tables but full exact comparisons are too slow. Propose approximate algorithms and approaches (hash sampling, bloom filters, locality-sensitive hashing) to detect mismatches with quantifiable error bounds, and explain how and when to escalate approximate mismatches to exact verification.
HardSystem Design
0 practiced
Design a system that correlates pipeline logs, schema-change records, deployment events, and data quality metrics to automatically surface likely root causes for metric breaks. Describe the data model for correlation, indexing strategies, heuristics to rank candidates, and a UI that helps on-call engineers quickly validate suggestions.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Anomaly Detection interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.