Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.
EasyTechnical
64 practiced
You receive daily CSV files from an external vendor. List at least five validation checks you would perform during ingestion (for example: schema conformance, duplicate detection, date range checks, domain/value range checks, referential integrity). For each check explain typical failure modes, how you would notify stakeholders, and whether you would quarantine, reject, or auto-correct bad files.
HardSystem Design
106 practiced
Design monitoring alerts that differentiate between data freshness issues, schema changes, and logic regressions in production pipelines to reduce pager noise. Provide example metrics and thresholds for each category, describe how to correlate signals before alerting, and outline a remediation playbook for on-call engineers.
HardTechnical
76 practiced
A transformation job produced different outputs across two runs even though code did not change. Walk through the debugging steps you would take: verifying input snapshots, checking code and dependency versions, comparing environment differences, finding non-deterministic operations (unordered aggregations, non-stable joins), checking random seeds, and sampling. What tooling and processes would you implement to prevent recurrence?
MediumSystem Design
72 practiced
You maintain a Parquet-based data lake consumed by multiple teams. Describe a strategy to handle schema evolution when producers add, remove, or rename columns. Discuss backward/forward compatibility, nullable defaults, column renames and migration plans, use of schema registries (Avro/Protobuf), and consumer-side defensive parsing.
HardSystem Design
71 practiced
Design a streaming analytics pipeline using Kafka + a stream processor (e.g., Flink or Spark Structured Streaming) to compute 1-minute rolling metrics (such as active users per minute) from user events that may arrive out-of-order and late. Explain event-time vs processing-time, watermark strategy, windowing choices, state management, checkpointing, scaling, and how to surface corrections in downstream dashboards.
Unlock Full Question Bank
Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.