InterviewStack.io LogoInterviewStack.io

Data Processing and Transformation Questions

Focuses on algorithmic and engineering approaches to transform and clean data at scale. Includes deduplication strategies, parsing and normalizing unstructured or semi structured data, handling missing or inconsistent values, incremental and chunked processing for large datasets, batch versus streaming trade offs, state management, efficient memory and compute usage, idempotency and error handling, and techniques for scaling and parallelizing transformation pipelines. Interviewers may assess problem solving, choice of algorithms and data structures, and pragmatic design for reliability and performance.

HardTechnical
41 practiced
You suspect silent data corruption occurs after a compression step in a transformation pipeline. Describe a systematic debugging and prevention approach including end-to-end checksums, unit and integration tests, data sampling for forensics, and deployment safeguards to prevent corrupted data publication.
MediumSystem Design
41 practiced
Design monitoring and alerting for data quality in a production analytics pipeline. Include specific metrics (null-rate, distinct-key cardinality, schema change events, throughput deviations), alert thresholds, and remediation playbooks for high-severity incidents.
MediumTechnical
36 practiced
Describe how to implement external merge sort to sort a file of records larger than available memory. Provide pseudocode and explain the stage of creating sorted runs, merging them, the number of passes, temporary storage needs, and how to tune for disk I/O and memory constraints.
MediumTechnical
39 practiced
You need to process a single 500 GB JSON file containing nested records on a machine with 1 GB memory and write a normalized CSV. Describe an algorithm and implementation approach (in Python) that processes the file in chunks, parses nested JSON safely, preserves order, and avoids OOM. Include how to handle malformed JSON lines.
HardTechnical
30 practiced
You have an expensive pandas transformation that OOMs on a 100M-row dataset. Propose and compare alternative solutions including chunked processing, Dask, Spark, PyArrow streaming, and rewriting critical parts in vectorized or native code. For each option discuss development complexity, performance, and memory characteristics.

Unlock Full Question Bank

Get access to hundreds of Data Processing and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.