InterviewStack.io LogoInterviewStack.io

Data Processing and Transformation Questions

Focuses on algorithmic and engineering approaches to transform and clean data at scale. Includes deduplication strategies, parsing and normalizing unstructured or semi structured data, handling missing or inconsistent values, incremental and chunked processing for large datasets, batch versus streaming trade offs, state management, efficient memory and compute usage, idempotency and error handling, and techniques for scaling and parallelizing transformation pipelines. Interviewers may assess problem solving, choice of algorithms and data structures, and pragmatic design for reliability and performance.

HardTechnical
0 practiced
You observe severe skew when joining a large user profile table with an event stream for feature enrichment, causing hotspotting and slowdowns. Propose concrete strategies at the data partitioning, pipeline, and algorithmic levels to mitigate skew while preserving correctness (e.g., salting, broadcast joins, partial pre-aggregation, sampling). Explain trade-offs.
EasyTechnical
0 practiced
Write a Python function to deduplicate an in-memory list of user records (each a dict) by email in a case-insensitive way. When duplicate emails are found, merge records by summing numeric fields (like purchase_count) and keeping the most recent timestamp. The function should be O(n) in time and O(n) additional space. Provide code and describe edge cases.
HardTechnical
0 practiced
Implement (or outline in pseudocode) a function that writes per-user large numeric feature arrays to disk in a space-efficient format that supports fast partial reads for online batched lookups. Discuss the trade-offs between row-oriented vs columnar storage, compression codecs, and how you'd index files for fast partial retrieval.
EasyBehavioral
0 practiced
Tell me about a time when you discovered a data quality issue that would have affected model performance if left undetected. Describe the situation, how you diagnosed the issue, actions you took to mitigate it, and the preventative steps you implemented afterward (STAR format).
MediumSystem Design
0 practiced
Design an offline and online feature pipeline for a recommendation model. Requirements: 100k writes/sec of user events to feature computations, 1M reads/sec for online lookups, point-in-time correctness for offline training, and efficient backfills. Describe components (streaming ingestion, feature store, online store), storage choices, consistency model, and monitoring you would implement.

Unlock Full Question Bank

Get access to hundreds of Data Processing and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.