InterviewStack.io LogoInterviewStack.io

Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

HardTechnical
112 practiced
Write pytest unit tests for a pandas transformation function transform(df) that imputes missing numeric values with median, encodes a categorical column with one-hot encoding, and scales numerics. Provide sample fixture DataFrames covering normal cases and edge cases (all-NaN column, unseen category) and examples of assertions you would make (shapes, column names, non-null numeric columns).
EasyTechnical
65 practiced
Given a transactions DataFrame with columns ['user_id', 'transaction_id', 'amount', 'transaction_date'], write pandas code to compute for each user:- total_amount- transaction_count- average_amount- most_recent_transaction_dateReturn a DataFrame indexed by user_id with these columns. Provide an explanation of any aggregation choices.
MediumTechnical
67 practiced
Discuss trade-offs between one-hot encoding and target encoding for a categorical feature with high cardinality (~10,000 categories). In what situations is target encoding appropriate? How do you avoid target leakage when applying target encoding in cross-validation or production?
MediumTechnical
117 practiced
Design a scikit-learn pipeline using ColumnTransformer to handle a dataset with mixed features: numeric_cols = ['age','income'], categorical_cols = ['region','plan'], with missing values. Pipeline should:- Impute numeric with median and scale- Impute categorical with 'missing' and OneHotEncode (handle_unknown='ignore')- Fit a RandomForestClassifier on processed featuresProvide complete Python code for the pipeline.
MediumTechnical
64 practiced
Implement a custom scikit-learn Transformer named InteractionFeatures that takes a list of numeric columns and returns their pairwise product interactions (e.g., x1*x2, x1*x3...). The transformer should implement fit/transform, be compatible with pandas DataFrames preserving column names, and integrate into a Pipeline. Provide Python code.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.