InterviewStack.io LogoInterviewStack.io

Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

HardSystem Design
59 practiced
You must train a model on 100 million rows that do not fit into memory. Describe concrete strategies using pandas and numpy to preprocess and train: out-of-core learning with partial_fit, chunked preprocessing and feature extraction, using numpy.memmap for large arrays, and when to adopt Dask or Spark. Provide example code demonstrating chunked training with sklearn.linear_model.SGDClassifier.partial_fit.
HardTechnical
55 practiced
Discuss numerical stability and floating point issues in numpy computations (overflow, underflow, catastrophic cancellation) and how they can affect scikit-learn algorithms such as log-loss calculation or SVD/PCA. Provide practical mitigation techniques including scaling, using the log-sum-exp trick for stable softmax/log-sum computations, choosing float32 vs float64, and show a stable log-sum-exp code example.
MediumTechnical
58 practiced
You need to compute column z = sqrt(x**2 + y**2) for 20 million rows stored in a pandas DataFrame with columns 'x' and 'y'. Compare three approaches: Python loop, df.apply(row-wise), and NumPy vectorized computation. Provide code for each approach, report expected relative timings, and implement the fastest approach using numpy for best speed and minimal memory overhead.
MediumTechnical
57 practiced
Explain data leakage introduced by scaling or feature selection before cross-validation. Provide code showing an incorrect approach where StandardScaler is fit on the whole dataset before cross_val_score, then fix it by moving scaling into a Pipeline so scaling occurs inside each fold. Explain the impact on reported metrics.
EasyTechnical
55 practiced
Explain differences between NumPy slicing (views) and advanced integer indexing (copies). Given a = np.arange(12).reshape(3,4), demonstrate a[1:3,:], a[[0,2],[1,3]], and show how assignment to these views or copies affects the original array. Provide code and explain when a copy is created.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.