InterviewStack.io LogoInterviewStack.io

Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

EasyTechnical
76 practiced
An alert fires: 'daily-sales ETL produced 0 rows' for yesterday's batch. As the on-call SRE, outline your 15-minute triage checklist: which systems and logs you inspect, what quick mitigations you run (e.g., re-run vs rollback), and how you communicate status to stakeholders.
EasyTechnical
116 practiced
You must transform and aggregate a 10GB CSV on a machine with 4GB RAM using pandas. Describe a practical approach to perform transformations and aggregations without running out of memory. Mention chunked reading, explicit dtypes, early filtering, disk-backed options, and alternatives if pandas is unsuitable.
HardSystem Design
64 practiced
Design a globally distributed streaming ETL pipeline to process clickstream events at peak loads of 1M events/sec globally into a central analytics store. Requirements: end-to-end latency <5s, cross-region failover, exactly-once processing per event, schema evolution support, and cost efficiency. Describe components, routing, partitioning strategy, replication, and failure modes.
MediumTechnical
77 practiced
You notice an increasing trend of late-arriving events that cause daily reporting SLAs to miss targets. Propose specific monitoring metrics (e.g., lateness distribution, percent on-time), alerting thresholds, and a runbook to triage and mitigate late data. Include short-term and long-term remediation steps.
MediumTechnical
76 practiced
As an SRE, how would you influence data engineering teams to adopt defensive data validation and observability practices? Provide a concrete plan including documentation, onboarding checklists, shared libraries, automated pre-commit checks, example dashboards, and incentives or gating for production deployments.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.