Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

MediumTechnical

0 practiced

Implement a thread-safe producer-consumer pipeline in Python that reads streaming records from a socket and applies CPU-bound transformations. Provide code sketch using multiprocessing or concurrent.futures to avoid GIL-related bottlenecks, describe backpressure mechanisms, and outline graceful shutdown handling.

MediumSystem Design

0 practiced

Design a fault-tolerant ETL pipeline to load daily web traffic logs (1TB/day) into a data warehouse. Requirements: near-daily SLA of 3 hours, safe retries with minimal duplicates, ability to backfill, detect schema drift, and keep operational overhead low. Sketch the components (ingest, buffer, transform, sink), storage formats, and key failure modes with mitigations.

MediumTechnical

0 practiced

You need to backfill 6 months of corrected data after fixing a transformation bug. Describe a safe backfill strategy that avoids duplicate downstream data, minimizes impact on production pipelines, and controls cost. Cover dry-run, separate namespace for outputs, idempotent writes, throttling, and verification steps.

EasyTechnical

0 practiced

Given a table events(event_id PK, user_id, event_time TIMESTAMP, payload JSON), write a standard SQL query to de-duplicate events per (user_id, payload) keeping only the row with the latest event_time. Explain how your query treats NULL event_time and how you would break ties deterministically. Provide a brief example input and expected output.

MediumTechnical

0 practiced

You notice an increasing trend of late-arriving events that cause daily reporting SLAs to miss targets. Propose specific monitoring metrics (e.g., lateness distribution, percent on-time), alerting thresholds, and a runbook to triage and mitigate late data. Include short-term and long-term remediation steps.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.