InterviewStack.io LogoInterviewStack.io

Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

EasyTechnical
61 practiced
Explain the trade-offs between normalized and denormalized analytics schemas from an SRE perspective. Cover query latency, storage usage, update complexity, operational failure isolation, and maintainability. Give concrete examples where denormalization improves reliability and where it makes recovery harder.
EasyTechnical
63 practiced
List and explain essential defensive validation checks you would implement at the ingress of a data pipeline: schema conformance, nullability, type ranges, cardinality checks, referential integrity basics, and freshness. For each check, explain how it contributes to reliability, what alerts you'd surface, and what the typical symptoms would look like if it failed.
HardTechnical
69 practiced
Provide pseudocode for a producer that writes Avro messages to a topic using a schema registry and for a consumer that reads them. Show how the consumer handles a newer optional field added to the schema (consumer sees messages both with and without the new field) and describe how default values should be applied.
HardTechnical
57 practiced
Explain at-least-once, at-most-once, and exactly-once semantics in streaming systems. For each semantic, give an example use case where it is acceptable, list operational failure modes to monitor, and describe how an SRE should mitigate duplicates or data loss in each case.
EasyTechnical
116 practiced
You must transform and aggregate a 10GB CSV on a machine with 4GB RAM using pandas. Describe a practical approach to perform transformations and aggregations without running out of memory. Mention chunked reading, explicit dtypes, early filtering, disk-backed options, and alternatives if pandas is unsuitable.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.