Data Transformation and Preparation Questions

Focuses on the technical skills and judgement required to connect to data sources, clean and shape data, and prepare datasets for analysis and visualization. Includes identifying necessary transformations such as calculations, aggregations, filtering, joins, and type conversions; deciding whether to perform transformations in the business intelligence tool or in the data warehouse or database layer; designing efficient data models and extract transform load workflows; ensuring data quality, lineage, and freshness; applying performance optimization techniques such as incremental refresh and pushdown processing; and familiarity with tools and features such as Power BI Power Query, Tableau data preparation capabilities, and structured query language for database level transformations. Also covers documentation, reproducibility, and testing of data preparation pipelines.

EasyBehavioral

71 practiced

Behavioral: Tell me about a time when you discovered a significant production data quality issue that affected reports or customer-facing metrics. Use the STAR method: describe the Situation, Task, Actions you took to contain and fix the problem (both immediate and long-term), and the measurable Result. Highlight how you communicated with stakeholders and what monitoring or preventive changes you implemented afterward.

HardTechnical

68 practiced

Design and write a MERGE (or equivalent) statement for implementing SCD Type 2 in Snowflake or BigQuery for a high-cardinality customer dimension. Requirements: preserve history with start/end timestamps, current_flag, and surrogate keys; support micro-batches without long-running locks; explain partitioning and cluster strategies and how to expire old history rows for GDPR retention.

MediumTechnical

144 practiced

Implement a PySpark Structured Streaming job (DataFrame API) that performs incremental ingestion of event data from Kafka, uses event-time watermarking to tolerate up to 1 hour of late events, deduplicates by event_id, and writes deduplicated events into a Delta table with idempotent upserts. Provide code skeleton and explain how you ensure exactly-once or at-least-once semantics and idempotency.

MediumTechnical

83 practiced

You are asked to design a reproducible documentation standard for transformation pipelines: what should be documented (schema contracts, transformation logic, test cases, SLAs), where to store it (catalog, repo, pipeline metadata), and how to ensure docs remain up to date (automation hooks, pre-merge checks). Provide a small example README template for a dataset.

MediumTechnical

91 practiced

Design a partitioning and file-sizing strategy for a 1 PB clickstream dataset stored on S3 with Delta Lake and Spark. Queries frequently filter by event_date and product_id and some analyses need fast scans across all products for a single day. Recommend partition keys, target file sizes, compaction strategy, and how to handle small files and hot partitions.

Unlock Full Question Bank

Get access to hundreds of Data Transformation and Preparation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.