InterviewStack.io LogoInterviewStack.io

Data Preprocessing and Handling for AI Questions

Covers the end to end preparation of raw data for analysis and modeling in machine learning and artificial intelligence. Topics include data collection and ingestion, data quality assessment, detecting and handling missing values with deletion or various imputation strategies, identifying and treating outliers, removing duplicates, and standardizing formats such as dates and categorical labels. Includes data type conversions, categorical variable encoding, feature scaling and normalization, standardization to zero mean and unit variance, and guidance on when each is appropriate given model choice. Covers feature engineering and selection, addressing class imbalance with sampling and weighting methods, and domain specific preprocessing such as data augmentation for computer vision and text preprocessing for natural language processing. Emphasizes correct order of operations, reproducible pipelines, splitting data into training validation and test sets, cross validation practices, and documenting preprocessing decisions and their impact on model performance. Also explains which models are sensitive to feature scale, common pitfalls, and evaluation strategies to ensure preprocessing does not leak information.

EasyTechnical
0 practiced
Describe a step-by-step approach to standardize and normalize date and time data coming from multiple sources in Excel or Power BI. Cover parsing ambiguous formats (MM/DD/YYYY vs DD/MM/YYYY), timezone normalization to UTC, daylight saving considerations, and how you'd implement these steps in Power Query (or Excel functions).
EasyTechnical
0 practiced
Describe One-Hot Encoding, Ordinal Encoding, and Hash/Binary encodings for categorical variables. For each encoding, explain when it's appropriate, potential pitfalls (high-cardinality, implied order), how to handle unseen categories at inference, and give a short example of implementation in pandas or SQL.
HardTechnical
0 practiced
Describe an algorithm or provide code outline in Python to create stratified group cross-validation folds: groups (e.g., users) must not be split across folds, while maintaining approximately equal label distribution per fold. Mention limitations of sklearn's GroupKFold and libraries or heuristics you might use for better stratification.
MediumTechnical
0 practiced
Describe how preprocessing steps (imputation, scaling, encoding) should be integrated into cross-validation to avoid data leakage. Provide an example design using scikit-learn's Pipeline and ColumnTransformer and explain why computing means or encodings on the whole dataset before CV is a mistake.
MediumTechnical
0 practiced
Explain target (mean) encoding for categorical variables and why it can leak information. Describe at least three strategies to apply target encoding safely (e.g., out-of-fold encoding, smoothing with prior, adding noise) and outline how to implement out-of-fold target encoding in a cross-validation loop.

Unlock Full Question Bank

Get access to hundreds of Data Preprocessing and Handling for AI interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.