InterviewStack.io LogoInterviewStack.io

Data Preprocessing and Handling for AI Questions

Covers the end to end preparation of raw data for analysis and modeling in machine learning and artificial intelligence. Topics include data collection and ingestion, data quality assessment, detecting and handling missing values with deletion or various imputation strategies, identifying and treating outliers, removing duplicates, and standardizing formats such as dates and categorical labels. Includes data type conversions, categorical variable encoding, feature scaling and normalization, standardization to zero mean and unit variance, and guidance on when each is appropriate given model choice. Covers feature engineering and selection, addressing class imbalance with sampling and weighting methods, and domain specific preprocessing such as data augmentation for computer vision and text preprocessing for natural language processing. Emphasizes correct order of operations, reproducible pipelines, splitting data into training validation and test sets, cross validation practices, and documenting preprocessing decisions and their impact on model performance. Also explains which models are sensitive to feature scale, common pitfalls, and evaluation strategies to ensure preprocessing does not leak information.

EasyTechnical
83 practiced
Compare one-hot encoding and ordinal (label) encoding for categorical variables. Explain the implications of each encoding choice on different model families (linear/logistic regression, tree ensembles, distance-based models) and discuss practical rules for BI analysts when cardinality is small, moderate, or very large (>1000 unique values).
MediumTechnical
89 practiced
Multiple upstream systems provide timestamped events in their local timezones. As a BI analyst designing ingestion and storage, describe a robust strategy: what timezone to store in (UTC vs local), whether to keep original timezone metadata, how to implement day/week/month aggregations by user's local time, and how to handle daylight saving changes and historical timezone rule changes.
MediumSystem Design
67 practiced
Design an ETL pipeline to transform raw clickstream JSON logs into analytics-ready tables for BI: sessionized_events(session_id, user_id, session_start, session_end), daily_active_users, and event_counts. Describe ingestion (Kafka or file-based), processing framework (Spark, Beam), sessionization logic (e.g., 30-minute inactivity), idempotency design, partitioning strategy, schema evolution handling, and how to expose materialized tables to BI tools (e.g., BigQuery, Redshift, data lakehouse).
MediumTechnical
64 practiced
Given a PostgreSQL table 'transactions(transaction_id PK, user_id INT, amount NUMERIC, occurred_at TIMESTAMP)', write a SQL query that flags each transaction as an outlier when amount > mean + 3 * stddev for that user over the prior 365 days. Assume large table: explain performance considerations, indexes/partitions, and how to handle users with fewer than 5 historical transactions (choose reasonable behavior).
MediumTechnical
72 practiced
For categorical variables with missing values, compare the common imputation options: filling with the mode, creating a dedicated 'MISSING' category, model-based imputation, and using explicit missingness indicator columns. Discuss pros/cons for predictive modeling and for BI reporting, and state when missingness itself may be informative and should be modeled explicitly.

Unlock Full Question Bank

Get access to hundreds of Data Preprocessing and Handling for AI interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.