InterviewStack.io LogoInterviewStack.io

Data Transformation and Preparation Questions

Focuses on the technical skills and judgement required to connect to data sources, clean and shape data, and prepare datasets for analysis and visualization. Includes identifying necessary transformations such as calculations, aggregations, filtering, joins, and type conversions; deciding whether to perform transformations in the business intelligence tool or in the data warehouse or database layer; designing efficient data models and extract transform load workflows; ensuring data quality, lineage, and freshness; applying performance optimization techniques such as incremental refresh and pushdown processing; and familiarity with tools and features such as Power BI Power Query, Tableau data preparation capabilities, and structured query language for database level transformations. Also covers documentation, reproducibility, and testing of data preparation pipelines.

EasyTechnical
94 practiced
Given an events table: events(event_id BIGINT, user_id INT, event_type VARCHAR, occurred_at TIMESTAMP), write an ANSI SQL query that computes daily active users (unique user_id per day) for the last 30 days. Also explain how you'd handle late-arriving events that should be attributed to prior days.
MediumTechnical
89 practiced
During data preparation you find severe class imbalance (1% positive). Describe practical methods to address this imbalance before training: sampling strategies, class weighting, synthetic sample generation (SMOTE), and how to implement and validate these approaches such that production inference remains reliable.
MediumTechnical
94 practiced
You ingest streaming event logs and must compute per-user session features (session length and events per session) using PySpark. Describe the transformations, how to define sessionization and session timeouts, and outline Spark Structured Streaming pseudocode for computing these features while handling late events.
HardTechnical
144 practiced
Propose an algorithm to deduplicate user events at petabyte scale where events may be retried and carry different IDs but share a fingerprint (user_id, event_hash, timestamp). Address memory constraints, parallelization, and correctness guarantees. Provide high-level pseudocode and discuss trade-offs between exact and probabilistic approaches.
MediumTechnical
92 practiced
How would you design schema evolution for a nightly feature table that occasionally receives new columns from upstream jobs, ensuring backward compatibility for both model training and online serving? Describe migration steps, compatibility rules, and tests you would implement.

Unlock Full Question Bank

Get access to hundreds of Data Transformation and Preparation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.