Data Cleaning and Business Logic Edge Cases Questions

Covers handling data centric edge cases and complex business rule interactions in queries and data pipelines. Topics include cleaning and normalizing data, handling nulls and type mismatches, deduplication strategies, treating inconsistent or malformed records, validating results and detecting anomalies, using conditional logic for data transformation, understanding null semantics in SQL, and designing queries that correctly implement date boundaries and domain specific business rules. Emphasis is on producing robust results in the presence of imperfect data and complex requirements.

HardTechnical

0 practiced

An ETL job silently dropped 0.5% of rows after a schema change where a free-text column became numeric; downstream dashboards show unexpected drops in counts. As the on-call data analyst, describe the forensic steps you would take to identify the scope of data loss, recover missing rows if possible, and implement safeguards to prevent silent drops in the future.

HardTechnical

0 practiced

You operate an event stream where events sometimes arrive late and out-of-order. Design how to compute daily unique active users and daily revenue in both a batch and streaming architecture such that late arrivals up to 48 hours are accounted for properly. Explain windowing, watermarking, allowed-lateness, retractions, and how you'd validate final daily totals.

HardSystem Design

0 practiced

Design an automated testing framework for data transformations that includes unit-level SQL tests, property-based tests, golden-file comparisons, and statistical checks that detect edge cases such as schema drift, null explosion, or cardinality anomalies. Describe storage of test fixtures, CI integration, and how to surface failing tests to both engineers and analysts.

EasyTechnical

0 practiced

Given a relational table orders(order_id INT, user_id INT, amount DECIMAL(10,2), status VARCHAR, shipped_at TIMESTAMP), write an ANSI SQL query that returns total_amount_per_user and total_orders_per_user for events in 2024 while treating NULL amount as 0 and excluding orders where status = 'canceled' or status IS NULL. Explain the difference between COUNT(*) and COUNT(amount) when amount contains NULLs and describe how your query handles null semantics so aggregated results are robust.

MediumSystem Design

0 practiced

You're building a wide analytics table produced by ETL and need to add lineage and audit columns so analysts can debug upstream issues. List the fields you would add (e.g., source_system, source_row_id, ingestion_timestamp, transformation_id, checksum) and describe how each field helps during investigations and reprocessing.

Unlock Full Question Bank

Get access to hundreds of Data Cleaning and Business Logic Edge Cases interview questions and detailed answers.

Join thousands of developers preparing for their dream job.