Data Integration and Flow Design Questions

Design how systems exchange synchronize and manage data across a technology stack. Candidates should be able to map data flows from collection through activation, choose between unidirectional and bidirectional integrations, and select real time versus batch synchronization strategies. Coverage includes master data management and source of truth strategies, conflict resolution and reconciliation, integration patterns and technologies such as application programming interfaces webhooks native connectors and extract transform load processes, schema and field mapping, deduplication approaches, idempotency and retry strategies, and how to handle error modes. Operational topics include monitoring and observability for integrations, audit trails and logging for traceability, scaling and latency trade offs, and approaches to reduce integration complexity across multiple systems. Interview focus is on integration patterns connector trade offs data consistency and lineage and operational practices for reliable cross system data flow.

HardTechnical

0 practiced

Design and implement (pseudocode acceptable) a scalable deduplication service using Redis that de-duplicates events by event_id within a TTL-based dedupe window. The service must work correctly across multiple worker instances and avoid race conditions. Describe Redis primitives you would use (SETNX, EXISTS, Lua scripting, Redis streams), TTL strategy, and how you would handle Redis failover.

MediumTechnical

0 practiced

Given the table 'events' with schema:

events( event_id uuid primary key, user_id int, event_time timestamp, data jsonb)

Write a PostgreSQL query that returns the latest event per user (user_id) within the last 30 days, deduplicating by user_id and keeping only the row with the greatest event_time. Include handling for ties and a sample of how you would index the table for performance.

EasyTechnical

0 practiced

Compare batch and streaming synchronization strategies for moving data between systems (for example syncing CRM events to a data warehouse). For each approach describe expected latency, ordering guarantees, idempotency concerns, checkpointing/watermarking behavior, tooling examples (Airflow, Spark, Flink, Kafka, Pulsar), and concrete use cases where one approach is preferable over the other.

MediumTechnical

0 practiced

You receive an alert that a downstream analytics job reports missing records after an ETL run. Walk through the incident investigation steps you would take: what logs, metrics, offsets, and systems to check first, how to determine whether the loss occurred at source, ingestion, transform, or load, and how you would safely restore missing data without introducing duplicates.

HardTechnical

0 practiced

Explain how to achieve exactly-once processing semantics end-to-end from a transactional database to an analytics store (e.g., Postgres -> Kafka -> Snowflake). Discuss practical patterns: transactional CDC, Kafka transactions, idempotent upserts at the sink, deduplication strategies, limitations, and operational considerations like replays and schema changes.

Unlock Full Question Bank

Get access to hundreds of Data Integration and Flow Design interview questions and detailed answers.

Join thousands of developers preparing for their dream job.