Pandas Data Manipulation and Analysis Questions
Data manipulation and analysis using the Pandas library: reading data from CSV or SQL sources, selecting and filtering rows and columns, boolean indexing, iloc and loc usage, groupby aggregations, merging and concatenating DataFrames, handling missing values with dropna and fillna, applying transformations via apply and vectorized operations, reshaping with pivot and melt, and performance considerations for large DataFrames. Includes converting SQL style logic into Pandas workflows for exploratory data analysis and feature engineering.
EasyTechnical
0 practiced
Describe common strategies in pandas to handle missing values for ML preprocessing. Given DataFrame df with columns ['feature1','feature2','category','timestamp'] provide code to: (a) drop rows where 'feature1' is NaN; (b) fill 'feature2' with median; (c) fill 'category' with the mode; and (d) briefly explain when forward/backward fill (ffill/bfill) is appropriate for time-series data.
HardTechnical
0 practiced
Convert the following SQL into an idiomatic pandas implementation and describe each step for performance and clarity:WITH ranked AS ( SELECT user_id, amount, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) rn FROM transactions)SELECT r.user_id, SUM(t.amount) AS total_recentFROM ranked r JOIN transactions t ON r.user_id = t.user_idWHERE r.rn = 1 AND t.timestamp >= r.timestamp - INTERVAL '30 days'GROUP BY r.user_id;Explain how you'd implement this in pandas and how to optimize it for large datasets.
MediumTechnical
0 practiced
For a high-cardinality categorical feature, compare one-hot encoding (pd.get_dummies) vs target (mean) encoding implemented with pandas. Discuss implementation steps, how to use out-of-fold schemes in pandas to avoid leakage, overfitting risks, and operational concerns at inference (unseen categories).
HardTechnical
0 practiced
Write an efficient, memory-conscious approach to compute the top-3 most frequent 'item' per 'user' from a very large CSV that cannot be loaded fully into memory. Describe a chunked strategy: compute chunk-level counts, persist intermediate aggregates, merge partial counts, and produce final top-3 per user. Provide code outline (Python/pandas) and discuss complexity and I/O tradeoffs.
MediumTechnical
0 practiced
Given a high-frequency timezone-aware timeseries DataFrame df with a UTC DatetimeIndex and a numeric column 'volume', write pandas code to convert timestamps to 'America/New_York' then resample to daily frequency computing the daily sum. Explain how to handle DST transitions and the difference between converting tz vs localizing tz-naive timestamps.
Unlock Full Question Bank
Get access to hundreds of Pandas Data Manipulation and Analysis interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.