InterviewStack.io LogoInterviewStack.io

Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

EasyTechnical
0 practiced
An alert fires: 'daily-sales ETL produced 0 rows' for yesterday's batch. As the on-call SRE, outline your 15-minute triage checklist: which systems and logs you inspect, what quick mitigations you run (e.g., re-run vs rollback), and how you communicate status to stakeholders.
MediumTechnical
0 practiced
A downstream job fails intermittently because upstream emits malformed JSON. How would you implement error-handling to prevent cascading failures while preserving malformed records for debugging? Consider dead-letter queues, schema validation at ingress, circuit breakers, and alerting thresholds in your design.
HardSystem Design
0 practiced
Design a globally distributed streaming ETL pipeline to process clickstream events at peak loads of 1M events/sec globally into a central analytics store. Requirements: end-to-end latency <5s, cross-region failover, exactly-once processing per event, schema evolution support, and cost efficiency. Describe components, routing, partitioning strategy, replication, and failure modes.
MediumTechnical
0 practiced
You run transformations as Kubernetes CronJobs that sometimes overlap and cause resource contention and duplicates. Propose architectural and orchestration changes to ensure reliability, idempotency, and predictable scheduling (e.g., leader election, queue-backed workers, concurrency controls, or replacing CronJobs with event-driven triggers). Explain benefits and trade-offs.
HardTechnical
0 practiced
Case study: After a product launch, your analytics processing costs increased 4x. Analyze and propose a prioritized plan to reduce cost while preserving essential SLAs: consider infrastructure rightsizing, job consolidation, caching, format and compression changes, scheduling, use of spot instances, and SLA trade-offs. Provide measurable milestones and rollback plan.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.