Data Processing and Transformation Questions

Focuses on algorithmic and engineering approaches to transform and clean data at scale. Includes deduplication strategies, parsing and normalizing unstructured or semi structured data, handling missing or inconsistent values, incremental and chunked processing for large datasets, batch versus streaming trade offs, state management, efficient memory and compute usage, idempotency and error handling, and techniques for scaling and parallelizing transformation pipelines. Interviewers may assess problem solving, choice of algorithms and data structures, and pragmatic design for reliability and performance.

EasyTechnical

0 practiced

Write a SQL query using window functions to compute a per-user 30-day rolling average of amount from the transactions table (transaction_id, user_id, amount, occurred_at). The query must handle sparse activity (i.e., users with no events in some days) by averaging over available transactions in the window. Show sample SQL and explain performance considerations.

HardBehavioral

0 practiced

Describe a time you discovered a silent data pipeline failure that produced bad features in production models. Explain how you detected the issue, coordinated mitigation and rollback, communicated with stakeholders, and the postmortem actions you implemented to prevent recurrence. Be specific about tools and processes used.

HardTechnical

0 practiced

As a senior ML engineer asked to prioritize technical debt in the data transformation layer across multiple teams, describe how you would assess and prioritize work (reliability, scalability, developer experience), build a roadmap, and drive cross-team changes. Include metrics you would collect and how you'd measure impact.

EasyTechnical

0 practiced

Design pseudocode or a small Python function that maintains incremental aggregates (count, sum, mean) per user in a streaming fashion. Input: events with fields (user_id, value, event_time). Describe the in-memory state you keep per user, how you update it on each event, and how you would persist or checkpoint state for fault tolerance.

HardTechnical

0 practiced

You observe severe skew when joining a large user profile table with an event stream for feature enrichment, causing hotspotting and slowdowns. Propose concrete strategies at the data partitioning, pipeline, and algorithmic levels to mitigate skew while preserving correctness (e.g., salting, broadcast joins, partial pre-aggregation, sampling). Explain trade-offs.

Unlock Full Question Bank

Get access to hundreds of Data Processing and Transformation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.