InterviewStack.io LogoInterviewStack.io

Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

MediumTechnical
73 practiced
You are training a classifier on an imbalanced dataset. Explain how to implement stratified k-fold cross-validation with scikit-learn and integrate oversampling (e.g., SMOTE) properly into the pipeline so that oversampling is only applied to training folds and does not leak into validation. Provide a short code sketch using imblearn (if you know it) or describe alternative approaches.
MediumTechnical
101 practiced
Explain how to use joblib.Memory to cache expensive transformation steps in scikit-learn pipelines during iterative development. Provide an example where a custom feature engineering step is cached to avoid recomputing on parameter search iterations and describe limitations in distributed environments.
HardTechnical
95 practiced
Explain when and why you would use probability calibration (e.g., sklearn.calibration.CalibratedClassifierCV). Give an example scenario where model scores are not well calibrated and how calibration affects downstream decision thresholds or business metrics.
HardTechnical
70 practiced
Write Python code to efficiently compute rolling window features on a time-series DataFrame with columns ['user_id','timestamp','value'], computing a 7-day rolling mean and count per user. The dataset is large—describe how you would implement chunked or group-wise processing to keep memory usage reasonable and ensure the rolling windows respect user boundaries.
MediumTechnical
54 practiced
Provide a code example showing how to use sklearn.pipeline.Pipeline and FeatureUnion (or ColumnTransformer) to combine engineered numeric polynomial features (degree 2) with original features, and then perform L2-regularized linear regression. Explain interaction terms and how to avoid feature explosion for high-cardinality inputs.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.