InterviewStack.io LogoInterviewStack.io

Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

HardTechnical
0 practiced
You have a slow pipeline where pandas groupby + apply dominates runtime on a 100M-row dataset. Outline a profiling and optimization plan: tools to profile CPU and memory, strategies to rewrite custom aggregations using vectorized operations, using numba or Cython for hotspots, and moving aggregation to a database. Demonstrate one concrete optimization with code and measured improvement.
HardTechnical
0 practiced
Given irregular timestamps per entity, compute for each timestamp the exponentially weighted mean (EWMA) for that entity with half-life of 7 days. Implement this in pandas using groupby and ewm or a manual approach that handles irregular sampling and preserves alignment to original timestamps. Explain numerical stability considerations.
HardTechnical
0 practiced
Write pytest unit tests for a scikit-learn Pipeline that validate: a) pipeline.fit_transform on a small dataset produces expected output shape and no NaNs, b) saving and loading the pipeline via joblib.dump/load yields identical predictions, and c) a custom transformer raises a ValueError for invalid input types. Provide example test code using temporary files and assertions.
HardTechnical
0 practiced
Design and implement a nested cross-validation workflow with scikit-learn for a time-dependent regression problem that avoids leakage. Show code where the outer loop is TimeSeriesSplit for estimation of generalization error and the inner loop is another TimeSeriesSplit or GridSearchCV used for hyperparameter tuning. Explain how to include preprocessing steps safely inside the pipeline.
EasyTechnical
0 practiced
Explain the performance differences between pandas.iterrows(), pandas.itertuples(), and vectorized operations. For a task that computes a new column as a simple arithmetic function of existing numeric columns, recommend the fastest approach and show example code.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.