InterviewStack.io LogoInterviewStack.io

Data Transformation and Preparation Questions

Focuses on the technical skills and judgement required to connect to data sources, clean and shape data, and prepare datasets for analysis and visualization. Includes identifying necessary transformations such as calculations, aggregations, filtering, joins, and type conversions; deciding whether to perform transformations in the business intelligence tool or in the data warehouse or database layer; designing efficient data models and extract transform load workflows; ensuring data quality, lineage, and freshness; applying performance optimization techniques such as incremental refresh and pushdown processing; and familiarity with tools and features such as Power BI Power Query, Tableau data preparation capabilities, and structured query language for database level transformations. Also covers documentation, reproducibility, and testing of data preparation pipelines.

EasyTechnical
0 practiced
Explain the difference between normalization (min-max scaling) and standardization (z-score). For which ML algorithms or feature distributions would you prefer one over the other? Include practical advice about scaling inside cross-validation folds and persisting scalers for production.
HardSystem Design
0 practiced
Outline an idempotent ETL design for ingesting events from Kafka into a feature table used for training and serving. Requirements: deduplicate retries, handle out-of-order events, provide at-least-once or exactly-once semantics where possible, and support efficient compaction. Explain the streaming framework choice and give pseudocode for deduplication logic.
HardSystem Design
0 practiced
Design a system that automatically decides when to retrain production models. Inputs include data drift metrics, model performance metrics, label availability, and compute budget. Describe thresholds, orchestration, human-in-the-loop options, canary retrains, and mechanisms to avoid retraining thrash.
EasyTechnical
0 practiced
Define data freshness and data latency in the context of ML feature pipelines. Give concrete examples showing how freshness requirements differ for batch training, daily features, and online inference, and explain how those requirements influence pipeline architecture.
MediumTechnical
0 practiced
Given tables users(id INT, signup_date DATE) and transactions(id, user_id, amount DECIMAL, occurred_at TIMESTAMP), write an ANSI SQL query that outputs a user-level monthly feature table containing: last_transaction_date, transactions_count_30d, avg_amount_90d, days_since_signup. Use window functions and explain how you handle users with no transactions.

Unlock Full Question Bank

Get access to hundreds of Data Transformation and Preparation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.