Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

MediumTechnical

68 practiced

Compare categorical encoding techniques (label encoding, one-hot, binary, target/mean encoding). For each, describe when it is appropriate, common pitfalls such as leakage in target encoding, and how to apply safely within cross-validation or online inference.

EasyTechnical

106 practiced

Explain the difference between missing values, nulls, NaN, and empty strings in tabular data. For each, give an example of how it may appear in CSV, SQL, and JSON sources, and state when you would treat it as a missing value versus a valid value for downstream analysis. Describe potential pitfalls when using automated schema or type inference.

HardSystem Design

59 practiced

Describe how to enforce and evolve schemas in a streaming pipeline using Avro/Protobuf/JSON Schema with a schema registry. Cover how to handle backward and forward incompatible changes, consumer upgrades, and requirements for replay/backfill without breaking consumers.

MediumSystem Design

60 practiced

Design a deduplication strategy for streaming events produced with at-least-once semantics. Describe how you'd implement deduplication both in a streaming engine (e.g., Flink or Spark Structured Streaming) and as an offline batch job. Include use of event IDs, windowing, watermarking, and state TTL to bound memory.

MediumTechnical

83 practiced

Outline a sequence of text cleaning steps you would apply to free-text fields for modeling: normalization, lowercasing, Unicode normalization, punctuation stripping, tokenization, stopword handling, and lemmatization. Also describe how you would preserve raw text for auditing and model explainability.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.