InterviewStack.io LogoInterviewStack.io

Data Processing and Transformation Questions

Focuses on algorithmic and engineering approaches to transform and clean data at scale. Includes deduplication strategies, parsing and normalizing unstructured or semi structured data, handling missing or inconsistent values, incremental and chunked processing for large datasets, batch versus streaming trade offs, state management, efficient memory and compute usage, idempotency and error handling, and techniques for scaling and parallelizing transformation pipelines. Interviewers may assess problem solving, choice of algorithms and data structures, and pragmatic design for reliability and performance.

MediumTechnical
36 practiced
Design a schema-evolution strategy for a company where producers frequently add optional fields and occasionally rename fields. Propose rules, compatibility guarantees, registry policies, and deployment steps for producer and consumer teams to avoid downstream breakage. As SRE, how would you enforce and automate these policies?
MediumTechnical
59 practiced
You have a skewed key distribution causing a small fraction of workers to be overloaded during parallel transformation. Describe partitioning strategies (hash partitioning, key-salting, range partitioning, consistent hashing, dynamic re-sharding) and pick one to mitigate heavy hitters. Include trade-offs and implementation steps for an SRE-run pipeline.
HardTechnical
30 practiced
Design a privacy and compliance-aware transformation pipeline that strips PII from events and records lineage for audit purposes. Include techniques for deterministic masking, tokenization, access controls, reversible vs irreversible transforms, and how to maintain searchable lineage while protecting sensitive fields.
EasySystem Design
52 practiced
Design a cron-driven ETL job (single-node) that picks up CSV files dropped into an object store every hour, validates and transforms them, writes partitioned outputs, and supports atomic commit and easy rollback in case of failure. Describe file staging, atomic rename/manifest use, and steps to ensure partial failures do not expose incomplete data to consumers.
MediumTechnical
41 practiced
Design idempotent retry logic for a transformation that enriches records and writes to an external service that does not support idempotent writes natively. Propose two strategies (e.g., dedupe on a write-side store, use of idempotency keys, transactional outbox) and explain how each affects latency, consistency, and storage needs.

Unlock Full Question Bank

Get access to hundreds of Data Processing and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.