Covers the principles, frameworks, practices, and tooling used to ensure data is accurate, complete, timely, and trustworthy across systems and pipelines. Key areas include data quality checks and monitoring such as nullness and type checks, freshness and timeliness validation, referential integrity, deduplication, outlier detection, reconciliation, and automated alerting. Includes design of service level agreements for data freshness and accuracy, data lineage and impact analysis, metadata and catalog management, data classification, access controls, and compliance policies. Encompasses operational reliability of data systems including failure handling, recovery time objectives, backup and disaster recovery strategies, observability and incident response for data anomalies. Also covers domain and system specific considerations such as customer relationship management and sales systems: common causes of data problems, prevention strategies like input validation rules, canonicalization, deduplication and training, and business impact on forecasting and operations. Candidates may be evaluated on designing end to end data quality programs, selecting metrics and tooling, defining roles and stewardship, and implementing automated pipelines and governance controls.
MediumSystem Design
40 practiced
Design an observability and monitoring stack for data quality targeting revenue datasets. List the key dataset-level and pipeline-level metrics you would collect (e.g., row-count delta, null rate, uniqueness, value-distribution summaries, schema changes), describe how you'd detect distributional drift or silent failures, and outline sample dashboards and alerting rules for engineers vs. business stakeholders.
MediumTechnical
37 practiced
Design an approach to perform deduplication in a streaming architecture where lead events arrive into Kafka at 5k-10k events/sec. The dedupe window is 24 hours, and end-to-end latency must remain under 1 second. Outline the architecture (Kafka topics, stream processor, state store), explain how you will manage state (TTL, compaction), handle late-arriving events, and ensure correctness and performance.
EasyTechnical
65 practiced
Describe deduplication strategies for leads and contacts in a CRM. Cover deterministic approaches (unique identifiers), rule-based matching (email/phone), probabilistic/fuzzy matching, blocking strategies, and the golden-record/master-data approach. For each strategy, explain typical false-positive/false-negative trade-offs and recommend which approach fits an early-stage startup versus an enterprise with multiple source systems.
HardTechnical
47 practiced
Design an algorithm and provide readable pseudo-code (Python-style) to probabilistically deduplicate contact records when unique identifiers are missing. Use attributes like name, email, phone, company, and address. Include blocking strategy, feature weighting, similarity metrics (e.g., Levenshtein, Jaro-Winkler), scoring, and thresholds, and explain how you'd measure and tune precision/recall in production.
EasyTechnical
37 practiced
Marketing automation reports 12,000 leads last month while CRM shows 9,500 new leads. Outline a step-by-step reconciliation runbook you would follow to find the root cause. Include specific checks/queries you would run (e.g., match on external_id, compare UTM parameters, check timezones, dedupe), likely causes you would prioritize, and short-term fixes to ensure accurate pipeline metrics for the next month.
Unlock Full Question Bank
Get access to hundreds of Data Quality and Governance interview questions and detailed answers.