InterviewStack.io LogoInterviewStack.io

Data Cleaning and Business Logic Edge Cases Questions

Covers handling data centric edge cases and complex business rule interactions in queries and data pipelines. Topics include cleaning and normalizing data, handling nulls and type mismatches, deduplication strategies, treating inconsistent or malformed records, validating results and detecting anomalies, using conditional logic for data transformation, understanding null semantics in SQL, and designing queries that correctly implement date boundaries and domain specific business rules. Emphasis is on producing robust results in the presence of imperfect data and complex requirements.

EasyTechnical
0 practiced
Your source system provides dates in free-form strings across rows: '2024-03-01', '03/01/2024', '1 Mar 2024', '20240301'. As a data analyst writing SQL transformations, describe a robust approach to parse and normalize these into an ISO date column, marking unparseable rows for review. Include how you'd prioritize patterns, avoid false parses, and validate the result set.
MediumTechnical
0 practiced
Design an approach to validate and standardize international postal addresses in a customer table. Discuss trade-offs between using third-party address verification APIs versus in-house normalization, caching strategies to reduce API costs, batching modes (real-time vs scheduled), and how to handle addresses that fail validation without blocking user flows.
HardTechnical
0 practiced
Implement a SQL-based tiered deduplication: given customers(customer_id, source_system, name, email, created_at, completeness_score) where duplicate groups are defined by a fuzzy_key, produce a canonical customer per fuzzy_key by applying business precedence rules: prefer source_system 'crm' over 'import', higher completeness_score, then most recent created_at. Show the SQL pattern and explain how to merge remaining attributes from lower-ranked records.
MediumTechnical
0 practiced
You have a table users_raw with many duplicate entries. Write an SQL transformation using window functions that deduplicates by normalized email (lowercase/trimmed) and keeps the most complete record by counting non-null columns and then by latest created_at as a tiebreaker. Describe how you would merge non-null fields from secondary records into the chosen canonical one.
EasyTechnical
0 practiced
When deduplicating by a unique user identifier and created_at timestamp, you must choose whether to keep the earliest or latest record per user. Describe the business and technical factors that should influence that decision, and outline a SQL approach to implement either choice while preserving an audit trail of removed records.

Unlock Full Question Bank

Get access to hundreds of Data Cleaning and Business Logic Edge Cases interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.