InterviewStack.io LogoInterviewStack.io

Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

MediumTechnical
68 practiced
Compare categorical encoding techniques (label encoding, one-hot, binary, target/mean encoding). For each, describe when it is appropriate, common pitfalls such as leakage in target encoding, and how to apply safely within cross-validation or online inference.
EasyTechnical
106 practiced
Explain the difference between missing values, nulls, NaN, and empty strings in tabular data. For each, give an example of how it may appear in CSV, SQL, and JSON sources, and state when you would treat it as a missing value versus a valid value for downstream analysis. Describe potential pitfalls when using automated schema or type inference.
HardSystem Design
59 practiced
Describe how to enforce and evolve schemas in a streaming pipeline using Avro/Protobuf/JSON Schema with a schema registry. Cover how to handle backward and forward incompatible changes, consumer upgrades, and requirements for replay/backfill without breaking consumers.
MediumSystem Design
60 practiced
Design a deduplication strategy for streaming events produced with at-least-once semantics. Describe how you'd implement deduplication both in a streaming engine (e.g., Flink or Spark Structured Streaming) and as an offline batch job. Include use of event IDs, windowing, watermarking, and state TTL to bound memory.
MediumTechnical
83 practiced
Outline a sequence of text cleaning steps you would apply to free-text fields for modeling: normalization, lowercasing, Unicode normalization, punctuation stripping, tokenization, stopword handling, and lemmatization. Also describe how you would preserve raw text for auditing and model explainability.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.