InterviewStack.io LogoInterviewStack.io

Data Preprocessing and Handling for AI Questions

Covers the end to end preparation of raw data for analysis and modeling in machine learning and artificial intelligence. Topics include data collection and ingestion, data quality assessment, detecting and handling missing values with deletion or various imputation strategies, identifying and treating outliers, removing duplicates, and standardizing formats such as dates and categorical labels. Includes data type conversions, categorical variable encoding, feature scaling and normalization, standardization to zero mean and unit variance, and guidance on when each is appropriate given model choice. Covers feature engineering and selection, addressing class imbalance with sampling and weighting methods, and domain specific preprocessing such as data augmentation for computer vision and text preprocessing for natural language processing. Emphasizes correct order of operations, reproducible pipelines, splitting data into training validation and test sets, cross validation practices, and documenting preprocessing decisions and their impact on model performance. Also explains which models are sensitive to feature scale, common pitfalls, and evaluation strategies to ensure preprocessing does not leak information.

MediumTechnical
108 practiced
Describe strategies to handle high-cardinality categorical variables (thousands of unique product SKUs) differently for tree-based models and linear/logistic models. Cover frequency encoding, target (mean) encoding with CV and smoothing, hashing trick, and learned embeddings. Discuss leakage risks and when each strategy is preferable in a BI setting.
MediumTechnical
76 practiced
Compare common outlier detection techniques useful in BI: IQR-based capping/trimming, z-score and robust z-score, Winsorization, Isolation Forest, Local Outlier Factor (LOF), and DBSCAN-based anomaly detection. For each method describe assumptions about distribution, sensitivity to sample size, computational cost, and scenarios where it is preferable for reporting vs modeling.
HardSystem Design
80 practiced
You must preprocess a 1TB table in a cloud data warehouse (Snowflake/BigQuery) to perform imputations, encodings and aggregate joins before training models in Python. Propose strategies to minimize data egress and compute cost: SQL pushdown operations, materialized views, partitioning, sampling, and integrating warehouse compute with Python training environments (e.g., using BigQuery ML, dbt, or query-export pipelines). Discuss pros/cons and reproducibility implications.
EasyTechnical
83 practiced
Compare one-hot encoding and ordinal (label) encoding for categorical variables. Explain the implications of each encoding choice on different model families (linear/logistic regression, tree ensembles, distance-based models) and discuss practical rules for BI analysts when cardinality is small, moderate, or very large (>1000 unique values).
HardTechnical
76 practiced
For a high-cardinality categorical field used in logistic regression, present a concrete, leakage-safe target encoding approach using K-fold out-of-fold encoding with smoothing. Provide pseudocode or Python outline that computes encodings only from training folds, applies them to validation within CV, and describes how to encode the held-out test set. Explain smoothing and prior selection.

Unlock Full Question Bank

Get access to hundreds of Data Preprocessing and Handling for AI interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.