Data Processing and Transformation Questions

Focuses on algorithmic and engineering approaches to transform and clean data at scale. Includes deduplication strategies, parsing and normalizing unstructured or semi structured data, handling missing or inconsistent values, incremental and chunked processing for large datasets, batch versus streaming trade offs, state management, efficient memory and compute usage, idempotency and error handling, and techniques for scaling and parallelizing transformation pipelines. Interviewers may assess problem solving, choice of algorithms and data structures, and pragmatic design for reliability and performance.

EasyBehavioral

0 practiced

Design a simple on-call runbook for an SRE responding to a data transformation job failure that causes missing data in dashboards. The runbook should include triage steps, safe mitigation (e.g., pause downstream consumers, re-run steps), communication guidelines, and minimal checks before declaring incident resolution.

EasyTechnical

0 practiced

Describe the main trade-offs between batch and streaming processing for a site reliability context: latency, cost, consistency, operational complexity, recovery from failures, and reprocessing. Provide examples of three classes of workloads where batch is better and three where streaming is the right choice for an SRE-run analytics pipeline.

EasySystem Design

0 practiced

Design a cron-driven ETL job (single-node) that picks up CSV files dropped into an object store every hour, validates and transforms them, writes partitioned outputs, and supports atomic commit and easy rollback in case of failure. Describe file staging, atomic rename/manifest use, and steps to ensure partial failures do not expose incomplete data to consumers.

HardTechnical

0 practiced

Design at-scale stream-stream join behavior for two high-volume real-time feeds (orders and shipments) arriving out-of-order. Discuss windowing semantics, state retention, join strategies (e.g., bloom join, hash join), and how to size state and plan for rebalancing. Include failure recovery semantics and operational monitoring.

EasyTechnical

0 practiced

Write a short Python program that parallelizes a CPU-bound transformation across available cores using multiprocessing. The program must read a large input file in chunks, apply a provided transform function, and write outputs in order. Provide the API and explain how you maintain ordering and manage inter-process communication buffer sizes to avoid OOM.

Unlock Full Question Bank

Get access to hundreds of Data Processing and Transformation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.