InterviewStack.io LogoInterviewStack.io

Data Quality and Governance Questions

Covers the principles, frameworks, practices, and tooling used to ensure data is accurate, complete, timely, and trustworthy across systems and pipelines. Key areas include data quality checks and monitoring such as nullness and type checks, freshness and timeliness validation, referential integrity, deduplication, outlier detection, reconciliation, and automated alerting. Includes design of service level agreements for data freshness and accuracy, data lineage and impact analysis, metadata and catalog management, data classification, access controls, and compliance policies. Encompasses operational reliability of data systems including failure handling, recovery time objectives, backup and disaster recovery strategies, observability and incident response for data anomalies. Also covers domain and system specific considerations such as customer relationship management and sales systems: common causes of data problems, prevention strategies like input validation rules, canonicalization, deduplication and training, and business impact on forecasting and operations. Candidates may be evaluated on designing end to end data quality programs, selecting metrics and tooling, defining roles and stewardship, and implementing automated pipelines and governance controls.

MediumTechnical
0 practiced
You detect ~5% duplicate records in 'customer_master' created by merging multiple source systems. As a data scientist, outline a pragmatic step-by-step plan to deduplicate and implement MDM: discovery, blocking/matching strategy, rules for merging conflicting attributes, audit trail and rollback plan, and continuous prevention measures.
HardSystem Design
0 practiced
Architect a scalable reconciliation system that can detect discrepancies across 50 heterogeneous sources (databases, event streams, APIs), compute an impact score per discrepancy, and trigger prioritized corrective actions. Describe ingestion, normalization to canonical metrics, matching algorithms, source reliability scoring, and operator UI components.
MediumTechnical
0 practiced
Write an SQL query to reconcile total events per day between a streaming_events source and a daily_aggregates table and flag days where totals differ by >1%. Schemas:
streaming_events(event_id, user_id, event_type, occurred_at timestamp)
daily_aggregates(event_day date, event_type, event_count bigint)
Assume streaming_events may contain duplicates and late arrivals. Explain how you'd choose tolerances and handle late-arriving events.
HardSystem Design
0 practiced
Propose a canonical data model and governance approach for customer data spanning CRM, billing, support, and marketing systems. Define canonical entities, reconciliation rules for conflicting attributes (last-write, source-priority, merge rules), provenance capture, data ownership, and an API contract for analytics access.
MediumSystem Design
0 practiced
Design an automated pipeline that detects degraded feature quality (distribution shift, rising nulls) for production features and triggers model rollback. Describe monitoring signals, gating logic, integration with CI/CD, canary rollout approach, and safe rollback procedure with auditability.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Governance interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.