InterviewStack.io LogoInterviewStack.io

Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

EasyTechnical
0 practiced
Given two small pandas DataFrames:
left = pd.DataFrame({'id':[1,2,3], 'left_val':[10,20,30]})right = pd.DataFrame({'id':[2,3,4], 'right_val':[200,300,400]})
1) Show the results of inner, left, right, and outer merges on 'id'.2) In code, perform a left merge and include an indicator column that shows merge origin. Explain when each merge type is appropriate.
MediumTechnical
0 practiced
Explain differences between KFold, StratifiedKFold, and TimeSeriesSplit in scikit-learn. For a forecasting problem where data is ordered by time, provide sample code showing how to use TimeSeriesSplit for model validation and explain why shuffling would be inappropriate.
EasyTechnical
0 practiced
You have a pandas DataFrame 'df' with columns: user_id, event_time (datetime), metric (float) and some NaNs in metric. Write Python code to:1) Impute missing metric values with the median metric for that user's historical values.2) If a user has no historical non-null values, fill with the global median.3) After imputation, forward-fill within each user when ordering by event_time for any remaining NaNs.Explain edge-case handling.
MediumTechnical
0 practiced
Design a scikit-learn pipeline using ColumnTransformer to handle a dataset with mixed features: numeric_cols = ['age','income'], categorical_cols = ['region','plan'], with missing values. Pipeline should:- Impute numeric with median and scale- Impute categorical with 'missing' and OneHotEncode (handle_unknown='ignore')- Fit a RandomForestClassifier on processed featuresProvide complete Python code for the pipeline.
HardTechnical
0 practiced
A model had 98% cross-validation accuracy during development but performs poorly in production. You suspect target leakage via a timestamp-derived feature that uses future information. Describe a systematic approach to diagnose leakage using pandas and scikit-learn, including code to detect features highly correlated with future target values using time-based splits.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.