Data Transformation and Preparation Questions

Focuses on the technical skills and judgement required to connect to data sources, clean and shape data, and prepare datasets for analysis and visualization. Includes identifying necessary transformations such as calculations, aggregations, filtering, joins, and type conversions; deciding whether to perform transformations in the business intelligence tool or in the data warehouse or database layer; designing efficient data models and extract transform load workflows; ensuring data quality, lineage, and freshness; applying performance optimization techniques such as incremental refresh and pushdown processing; and familiarity with tools and features such as Power BI Power Query, Tableau data preparation capabilities, and structured query language for database level transformations. Also covers documentation, reproducibility, and testing of data preparation pipelines.

MediumTechnical

0 practiced

You must join a 1B-row fact table with a 10k-row dimension to enrich features. Discuss join strategies in distributed engines (Spark, Snowflake): broadcast join vs shuffle join, pros and cons, when to use each, and how to detect and handle data skew that can cause stragglers.

MediumTechnical

0 practiced

During data preparation you find severe class imbalance (1% positive). Describe practical methods to address this imbalance before training: sampling strategies, class weighting, synthetic sample generation (SMOTE), and how to implement and validate these approaches such that production inference remains reliable.

EasyTechnical

0 practiced

When preparing ML datasets, describe the criteria you use to decide whether to drop rows, impute missing values, or engineer a special 'missing' category. Give concrete examples of feature types where each choice is appropriate and explain how to make the approach reproducible in a pipeline.

MediumTechnical

0 practiced

How would you design schema evolution for a nightly feature table that occasionally receives new columns from upstream jobs, ensuring backward compatibility for both model training and online serving? Describe migration steps, compatibility rules, and tests you would implement.

HardSystem Design

0 practiced

Design a system that automatically decides when to retrain production models. Inputs include data drift metrics, model performance metrics, label availability, and compute budget. Describe thresholds, orchestration, human-in-the-loop options, canary retrains, and mechanisms to avoid retraining thrash.

Unlock Full Question Bank

Get access to hundreds of Data Transformation and Preparation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.