InterviewStack.io LogoInterviewStack.io

Data Preprocessing and Handling for AI Questions

Covers the end to end preparation of raw data for analysis and modeling in machine learning and artificial intelligence. Topics include data collection and ingestion, data quality assessment, detecting and handling missing values with deletion or various imputation strategies, identifying and treating outliers, removing duplicates, and standardizing formats such as dates and categorical labels. Includes data type conversions, categorical variable encoding, feature scaling and normalization, standardization to zero mean and unit variance, and guidance on when each is appropriate given model choice. Covers feature engineering and selection, addressing class imbalance with sampling and weighting methods, and domain specific preprocessing such as data augmentation for computer vision and text preprocessing for natural language processing. Emphasizes correct order of operations, reproducible pipelines, splitting data into training validation and test sets, cross validation practices, and documenting preprocessing decisions and their impact on model performance. Also explains which models are sensitive to feature scale, common pitfalls, and evaluation strategies to ensure preprocessing does not leak information.

EasyTechnical
0 practiced
Compare one-hot encoding and ordinal (label) encoding for categorical variables. Explain the implications of each encoding choice on different model families (linear/logistic regression, tree ensembles, distance-based models) and discuss practical rules for BI analysts when cardinality is small, moderate, or very large (>1000 unique values).
MediumTechnical
0 practiced
Write a Python function using pandas: impute_group_median(df, group_col, target_col) that fills missing values in target_col with the median computed within each group of group_col. The function should leave rows with missing group_col untouched and fill groups with no median using the global median. Provide the implementation and explain edge cases and complexity.
HardTechnical
0 practiced
You are coordinating a computer-vision defect-detection project with limited images per defect class. Propose augmentation strategies that preserve defect labels and improve generalization: geometric transforms (rotations, flips), photometric changes (brightness, contrast), synthetic oversampling, and class-aware augmentations. Explain how to evaluate that augmentations are not introducing label noise and how to integrate augmented data into BI dashboards (e.g., defect rate over time).
MediumTechnical
0 practiced
Compare common outlier detection techniques useful in BI: IQR-based capping/trimming, z-score and robust z-score, Winsorization, Isolation Forest, Local Outlier Factor (LOF), and DBSCAN-based anomaly detection. For each method describe assumptions about distribution, sensitivity to sample size, computational cost, and scenarios where it is preferable for reporting vs modeling.
MediumSystem Design
0 practiced
Explain how tools and practices like dbt for SQL transformations, MLflow/DVC for artifact/version tracking, and feature stores help ensure reproducible preprocessing pipelines in a BI context. Propose a minimal reproducibility stack (tools + processes) for a small team producing weekly model-backed dashboards and justify each component's role.

Unlock Full Question Bank

Get access to hundreds of Data Preprocessing and Handling for AI interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.