Design how systems exchange synchronize and manage data across a technology stack. Candidates should be able to map data flows from collection through activation, choose between unidirectional and bidirectional integrations, and select real time versus batch synchronization strategies. Coverage includes master data management and source of truth strategies, conflict resolution and reconciliation, integration patterns and technologies such as application programming interfaces webhooks native connectors and extract transform load processes, schema and field mapping, deduplication approaches, idempotency and retry strategies, and how to handle error modes. Operational topics include monitoring and observability for integrations, audit trails and logging for traceability, scaling and latency trade offs, and approaches to reduce integration complexity across multiple systems. Interview focus is on integration patterns connector trade offs data consistency and lineage and operational practices for reliable cross system data flow.
HardTechnical
0 practiced
Design a governance workflow and system to manage mapping rule changes for integrations, such that every change is traceable (who changed what and when), approved, and tested before deployment. Include versioning of mapping rules, automated tests (golden datasets), approval gates, audit logs, rollback mechanisms, and how you would expose traceability to auditors.
MediumTechnical
0 practiced
You're responsible for ensuring full data lineage for a compliance report. Which metadata fields and provenance information would you capture at each pipeline stage (ingest, transform, load), how would you store that metadata to make it queryable and auditable by auditors, and what access controls and retention policies would you enforce?
MediumTechnical
0 practiced
Describe an observability and monitoring plan for a heterogeneous set of integrations that include webhooks, scheduled ETL jobs, and streaming pipelines. For each integration type list concrete metrics (latency, throughput, error rate, lag), logs, traces, SLOs/SLIs, alert thresholds, dashboards, and runbook actions. Explain how you'd instrument end-to-end traceability and root-cause analysis.
MediumTechnical
0 practiced
Implement a Python function dedupe_latest(events) that consumes an iterator/generator of JSON-like dicts where each record contains keys: 'id' (string), 'ts' (ISO8601 timestamp string), and 'payload' (dict). The function should return a list of deduplicated events containing only the latest event for each 'id' (based on 'ts'). Aim for O(N) time and O(K) memory where K is the number of unique ids, and return the resulting list ordered by timestamp ascending. Assume timestamps fit in memory.
HardTechnical
0 practiced
Design and implement (pseudocode acceptable) a scalable deduplication service using Redis that de-duplicates events by event_id within a TTL-based dedupe window. The service must work correctly across multiple worker instances and avoid race conditions. Describe Redis primitives you would use (SETNX, EXISTS, Lua scripting, Redis streams), TTL strategy, and how you would handle Redis failover.
Unlock Full Question Bank
Get access to hundreds of Data Integration and Flow Design interview questions and detailed answers.