InterviewStack.io LogoInterviewStack.io

Data Processing and Transformation Questions

Focuses on algorithmic and engineering approaches to transform and clean data at scale. Includes deduplication strategies, parsing and normalizing unstructured or semi structured data, handling missing or inconsistent values, incremental and chunked processing for large datasets, batch versus streaming trade offs, state management, efficient memory and compute usage, idempotency and error handling, and techniques for scaling and parallelizing transformation pipelines. Interviewers may assess problem solving, choice of algorithms and data structures, and pragmatic design for reliability and performance.

HardTechnical
0 practiced
Partitioning keys in data lakes affect query performance. Given a dataset of web events frequently queried by date and country, design a partitioning scheme and file layout (file sizes, compaction policy) that optimizes common queries while controlling small-file counts. Explain how to validate and adapt the partitioning as query patterns change.
EasyTechnical
0 practiced
You operate an SRE-run log ingestion pipeline. Implement a parser in Go or Python that normalizes incoming semi-structured log lines into a canonical JSON schema with fields: timestamp (ISO8601), level (INFO/WARN/ERROR), message, source. Input may include inconsistent timestamp formats (e.g., '2024-02-01 15:04', 'Feb 1 2024 15:04:00', epoch seconds). Provide a function spec, example inputs and normalized outputs, and explain how you would add new formats safely in production.
HardTechnical
0 practiced
Architect a solution for storing and compacting large state (>TB) for a streaming job that incurs frequent small updates per key. Discuss using a log-structured merge (LSM) store like RocksDB, compaction strategies, read/write amplification, and operational techniques to prevent compaction stalls from affecting latency.
HardSystem Design
0 practiced
Build a mini-design for a streaming transform that keeps per-customer counters and must support state >1TB. Which state backend would you choose (RocksDB, Redis, in-memory with spill), how to shard state, how to compact old state, and how to design checkpoints to recover within a target RTO of 5 minutes?
EasyTechnical
0 practiced
Implement a lightweight transformation test harness (in your language of choice) that takes a transformation function and a set of input-output example tuples, runs the transform, and reports mismatches with detailed diff. Describe how you would integrate this harness into CI for SRE-owned transformations to prevent regressions.

Unlock Full Question Bank

Get access to hundreds of Data Processing and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.