InterviewStack.io LogoInterviewStack.io

Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

MediumTechnical
0 practiced
Implement a scikit-learn compatible transformer class named DateFeatureExtractor that accepts a datetime column name(s) and on transform returns numeric columns year, month, dayofweek, is_weekend, and days_since_first for each row. The transformer should implement fit and transform and work inside a Pipeline with other scikit-learn transformers. Provide the class skeleton and a short usage example with a pandas DataFrame.
EasyTechnical
0 practiced
Explain differences between NumPy slicing (views) and advanced integer indexing (copies). Given a = np.arange(12).reshape(3,4), demonstrate a[1:3,:], a[[0,2],[1,3]], and show how assignment to these views or copies affects the original array. Provide code and explain when a copy is created.
EasyTechnical
0 practiced
Given a DataFrame df with columns ['age','country','score'], write a pandas expression to filter rows where age is between 25 and 40, country is in ['US','CA'], and score is in the top 10% for that country. Provide code that avoids Python-level loops and explain how to compute per-country quantiles efficiently.
HardSystem Design
0 practiced
You must train a model on 100 million rows that do not fit into memory. Describe concrete strategies using pandas and numpy to preprocess and train: out-of-core learning with partial_fit, chunked preprocessing and feature extraction, using numpy.memmap for large arrays, and when to adopt Dask or Spark. Provide example code demonstrating chunked training with sklearn.linear_model.SGDClassifier.partial_fit.
MediumTechnical
0 practiced
You need to compute column z = sqrt(x**2 + y**2) for 20 million rows stored in a pandas DataFrame with columns 'x' and 'y'. Compare three approaches: Python loop, df.apply(row-wise), and NumPy vectorized computation. Provide code for each approach, report expected relative timings, and implement the fastest approach using numpy for best speed and minimal memory overhead.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.