Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.
EasyTechnical
78 practiced
Write a SQL query using window functions to compute, for each user and each date, the rolling 7-day sum of transactions. Given table transactions(id, user_id, amount, occurred_at TIMESTAMP), the output should include dates with no transactions (return 0 for those days). Explain how you generate a calendar of dates and any timezone assumptions you make.
MediumBehavioral
68 practiced
Tell me about a time when you had to prioritize multiple reliability work items (monitoring, refactor, incident fix) with limited team bandwidth. Describe the situation, how you evaluated risk and impact, the decision process, communication with stakeholders, and the outcome.
MediumSystem Design
71 practiced
Design a fault-tolerant ETL pipeline to load daily web traffic logs (1TB/day) into a data warehouse. Requirements: near-daily SLA of 3 hours, safe retries with minimal duplicates, ability to backfill, detect schema drift, and keep operational overhead low. Sketch the components (ingest, buffer, transform, sink), storage formats, and key failure modes with mitigations.
MediumSystem Design
83 practiced
Design an observability architecture for data transformations that supports per-batch and per-record metrics, lineage tracing, schema-change alerts, and easy drill-down from dashboards to raw records for debugging. Discuss storage choices for metrics, traces, lineage metadata, and retention/cost trade-offs.
HardTechnical
70 practiced
Design an idempotent write strategy for streaming sinks such as S3 or a data warehouse so that repeated retries or producer restarts do not produce duplicate rows. Consider idempotency tokens, deterministic partitioning and file naming, transactional warehouse writes, and deduplication strategies. Explain trade-offs between complexity and performance.
Unlock Full Question Bank
Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.