Data Quality and Governance Questions

Covers the principles, frameworks, practices, and tooling used to ensure data is accurate, complete, timely, and trustworthy across systems and pipelines. Key areas include data quality checks and monitoring such as nullness and type checks, freshness and timeliness validation, referential integrity, deduplication, outlier detection, reconciliation, and automated alerting. Includes design of service level agreements for data freshness and accuracy, data lineage and impact analysis, metadata and catalog management, data classification, access controls, and compliance policies. Encompasses operational reliability of data systems including failure handling, recovery time objectives, backup and disaster recovery strategies, observability and incident response for data anomalies. Also covers domain and system specific considerations such as customer relationship management and sales systems: common causes of data problems, prevention strategies like input validation rules, canonicalization, deduplication and training, and business impact on forecasting and operations. Candidates may be evaluated on designing end to end data quality programs, selecting metrics and tooling, defining roles and stewardship, and implementing automated pipelines and governance controls.

MediumTechnical

0 practiced

You own a revenue forecasting model and forecasts have underperformed in the last month. Walk through a data-focused root cause analysis: which data checks, feature distribution comparisons, and experiments would you run to determine if the issue is due to data quality (e.g., missing transactions, changed business rules) versus model drift?

HardTechnical

0 practiced

You're tasked to reduce false positives in automated data-quality alerts by 90% while keeping detection latency under 10 minutes. Propose an approach combining statistical methods, ML-based filtering, operator feedback loops, and threshold tuning. Include an evaluation plan and metrics to measure success.

EasyTechnical

0 practiced

You own a daily ETL feed that populates the revenue table used by finance. Define a Service Level Agreement (SLA) for data freshness and availability: include objective (example: freshness <= 2 hours), measurement method, acceptable failure modes, alerting/escalation path, and a short remediation runbook.

MediumTechnical

0 practiced

In Python (pandas) implement compute_freshness(df: DataFrame, timestamp_col: str, clock_time: datetime, sla_minutes: int) -> DataFrame that returns a summary table by partition (partition key assumed to be 'date') with columns: partition_date, max_timestamp, freshness_minutes, sla_breach (bool). Show how you handle timezone-aware datetimes and empty partitions.

HardSystem Design

0 practiced

Design an SLA enforcement layer that prevents promotion of downstream artifacts (aggregates, feature tables, models) when upstream data quality checks fail. Describe rule evaluation, a dependency graph, propagation logic, CI/CD integration, human override policies, and how to avoid blocking critical business operations.

Unlock Full Question Bank

Get access to hundreds of Data Quality and Governance interview questions and detailed answers.

Join thousands of developers preparing for their dream job.