Scikit Learn, Pandas, and NumPy Usage Questions

Practical proficiency with these core libraries. Pandas: DataFrames, data manipulation, handling missing values. NumPy: arrays, vectorized operations, mathematical functions. Scikit-learn: preprocessing, model fitting, evaluation metrics, pipelines. Knowing standard patterns and APIs. Writing efficient, readable code using these libraries.

EasyTechnical

0 practiced

Compare pandas.get_dummies and sklearn.preprocessing.OneHotEncoder. Discuss differences in how they handle unseen categories at inference time, whether they preserve column order/names, and performance trade-offs. Show code snippets illustrating consistent one-hot encoding between training and test sets.

EasyTechnical

0 practiced

You have a DataFrame `df` with columns `['user_id', 'age', 'country', 'income']` where `age` and `income` contain missing values. In Python/pandas, demonstrate three different ways to handle missing values appropriately: dropping rows, filling with global statistics, and filling with group-based statistics (per `country`). Show code snippets and discuss trade-offs.

HardTechnical

0 practiced

Describe how to convert a pandas DataFrame with mixed types into a SciPy sparse matrix suitable for scikit-learn estimators that accept sparse input. Provide code that converts one-hot encoded categorical features and scaled numeric features into a single csr_matrix while preserving column order.

EasyTechnical

0 practiced

Using Python and pandas, write code to perform the following tasks on a CSV file named `data.csv` (comma-separated, header in first row):

1) Load the CSV into a DataFrame, parsing a column named `event_time` as datetime.2) Print the first 5 rows and a concise summary (`info`) including non-null counts and dtypes.3) Select only the columns `['user_id', 'event_time', 'value']` and filter rows where `value > 0` and `event_time` is in calendar year 2021.4) Show how you would handle rows with malformed dates during parsing.

Provide Python/pandas code and state any assumptions.

HardTechnical

0 practiced

Write Python code to efficiently compute rolling window features on a time-series DataFrame with columns ['user_id','timestamp','value'], computing a 7-day rolling mean and count per user. The dataset is large—describe how you would implement chunked or group-wise processing to keep memory usage reasonable and ensure the rolling windows respect user boundaries.

Unlock Full Question Bank

Get access to hundreds of Scikit Learn, Pandas, and NumPy Usage interview questions and detailed answers.

Join thousands of developers preparing for their dream job.