InterviewStack.io LogoInterviewStack.io

Data Cleaning and Business Logic Edge Cases Questions

Covers handling data centric edge cases and complex business rule interactions in queries and data pipelines. Topics include cleaning and normalizing data, handling nulls and type mismatches, deduplication strategies, treating inconsistent or malformed records, validating results and detecting anomalies, using conditional logic for data transformation, understanding null semantics in SQL, and designing queries that correctly implement date boundaries and domain specific business rules. Emphasis is on producing robust results in the presence of imperfect data and complex requirements.

EasyTechnical
16 practiced
Design a deterministic transformation to canonicalize phone numbers into an E.164-like format for storage given inconsistent source formats: missing country code, leading zeros, extensions, spaces, and punctuation. Describe a Spark or Python implementation, how you would validate numbers, and how you'd handle ambiguous or invalid entries in ETL.
EasyTechnical
12 practiced
Explain idempotency for ETL/ELT jobs. List at least three concrete patterns to make batch and streaming jobs idempotent (describe upserts, de-duplication keys, atomic commits, and checkpointing). For each, discuss advantages, implementation complexity, and performance trade-offs.
HardTechnical
12 practiced
You're tasked with rolling out data contracts across multiple producer teams. Propose an enforcement architecture that catches contract violations early, provides clear feedback to producers, and supports gradual migration. Include a schema registry, producer-side validation, CI integration, enforcement policies (warn, quarantine, reject), SLOs for contracts, and operational tooling for rollback and metrics.
MediumTechnical
16 practiced
Propose a testing strategy for a job that maps raw clickstream events into sessionized session records. Include examples of unit tests, integration tests, property tests, and end-to-end validations. List the critical edge cases (out-of-order events, missing timestamps, duplicate events, very long sessions) and describe automation and CI integration for these tests.
HardTechnical
12 practiced
Architect a streaming enrichment solution: a high-throughput event stream (millions/hour) must be enriched with slowly-changing user profile data that itself updates frequently. Requirements: per-event latency <200ms, bounded memory, and eventual consistency for late updates. Explain use of compacted Kafka topics, local caches with TTL, async enrich fallback, state stores, and strategies for handling out-of-order profile updates.

Unlock Full Question Bank

Get access to hundreds of Data Cleaning and Business Logic Edge Cases interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.