Covers the principles, frameworks, practices, and tooling used to ensure data is accurate, complete, timely, and trustworthy across systems and pipelines. Key areas include data quality checks and monitoring such as nullness and type checks, freshness and timeliness validation, referential integrity, deduplication, outlier detection, reconciliation, and automated alerting. Includes design of service level agreements for data freshness and accuracy, data lineage and impact analysis, metadata and catalog management, data classification, access controls, and compliance policies. Encompasses operational reliability of data systems including failure handling, recovery time objectives, backup and disaster recovery strategies, observability and incident response for data anomalies. Also covers domain and system specific considerations such as customer relationship management and sales systems: common causes of data problems, prevention strategies like input validation rules, canonicalization, deduplication and training, and business impact on forecasting and operations. Candidates may be evaluated on designing end to end data quality programs, selecting metrics and tooling, defining roles and stewardship, and implementing automated pipelines and governance controls.
EasyTechnical
44 practiced
What is a schema registry and why is it important for data quality in ML systems? Explain how a schema registry integrates with Kafka/CDC producers and consumers, prevents schema drift, enforces compatibility (backward/forward), and supports feature engineering and reproducibility.
MediumTechnical
37 practiced
Define measurable SLAs and SLOs for data freshness for features used by a real-time recommendation model. Propose concrete SLOs (targets, windows, error budgets), alert thresholds, and describe how you would enforce and communicate breaches to stakeholders and consumers.
MediumTechnical
49 practiced
You maintain two tables: source_payments(order_id, amount, currency, created_at) in the OLTP source and warehouse_payments(order_id, amount_usd, processed_at) in the analytics warehouse. Write a SQL query to identify orders present in source in the last 30 days but either missing in the warehouse or with mismatched amount_usd after applying exchange rates from fx_rates(currency, rate_to_usd, effective_date). Include a reason column with values 'missing' or 'amount_mismatch' and account for timezone differences.
MediumSystem Design
41 practiced
Design an end-to-end data quality monitoring system for ML features that must support batch and streaming ingestion at 10M events/min. Requirements: real-time anomaly detection (latency < 5s), historical baselines and drift detection, lineage to raw tables, alerting with runbooks, and multi-tenant isolation. Outline architecture, components, data stores, and approaches to minimize false positives.
HardTechnical
49 practiced
As an ML Engineer, define roles, workflows, and runbooks for handling data anomalies: who to page at each severity level, triage steps, likely immediate remediation and rollback actions, communication templates for stakeholders, and how to tie incident priority to SLOs and business impact.
Unlock Full Question Bank
Get access to hundreds of Data Quality and Governance interview questions and detailed answers.