InterviewStack.io LogoInterviewStack.io

Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

MediumTechnical
0 practiced
Implement a custom scikit-learn Transformer named InteractionFeatures that takes a list of numeric columns and returns their pairwise product interactions (e.g., x1*x2, x1*x3...). The transformer should implement fit/transform, be compatible with pandas DataFrames preserving column names, and integrate into a Pipeline. Provide Python code.
HardTechnical
0 practiced
Two tables of 50M rows each must be joined in pandas on a key. Describe and justify strategies to perform this efficiently: dtype optimization, index usage, categorical conversions, chunked or external-merge approaches, use of Parquet/Arrow, using a database, or migrating to Dask/Spark. Provide pseudocode for a chunked merge approach in pandas.
MediumTechnical
0 practiced
You inherit code that uses df.apply(row-wise) with a Python function doing multiple conditional checks and lookups; profiling shows it's the bottleneck. Describe how you'd identify the slow parts and refactor to vectorized or faster implementations. Provide an example refactor replacing apply with vectorized np.where or .map.
MediumTechnical
0 practiced
A categorical feature has 500 categories and you plan to one-hot encode it. Discuss memory and performance implications and show how to use OneHotEncoder in scikit-learn to return a sparse matrix (scikit-learn >=1.2 uses sparse_output). Provide code that integrates this into a pipeline and explain trade-offs.
HardTechnical
0 practiced
Implement nested cross-validation for model selection using scikit-learn where inner loop performs GridSearchCV on a pipeline (preprocessing + classifier) and outer loop measures generalization. Provide code using cross_val_score or manual loops. Discuss computational cost and practical shortcuts.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.