InterviewStack.io LogoInterviewStack.io

Pandas Data Manipulation and Analysis Questions

Data manipulation and analysis using the Pandas library: reading data from CSV or SQL sources, selecting and filtering rows and columns, boolean indexing, iloc and loc usage, groupby aggregations, merging and concatenating DataFrames, handling missing values with dropna and fillna, applying transformations via apply and vectorized operations, reshaping with pivot and melt, and performance considerations for large DataFrames. Includes converting SQL style logic into Pandas workflows for exploratory data analysis and feature engineering.

HardTechnical
0 practiced
You want to port a pandas pipeline to GPU using RAPIDS cuDF to accelerate groupby and merge operations. Discuss what pandas APIs are supported by cuDF, what code changes are typically required, how to handle unsupported operations (fallback to CPU), and provide an example replacing pd.read_csv with cudf.read_csv then performing a groupby. Also discuss GPU memory limitations and strategies to handle them.
HardTechnical
0 practiced
Write an efficient, memory-conscious approach to compute the top-3 most frequent 'item' per 'user' from a very large CSV that cannot be loaded fully into memory. Describe a chunked strategy: compute chunk-level counts, persist intermediate aggregates, merge partial counts, and produce final top-3 per user. Provide code outline (Python/pandas) and discuss complexity and I/O tradeoffs.
MediumTechnical
0 practiced
You must merge six DataFrames of varying sizes into a single training table. Sizes: A(100M rows), B(5M), C(1M), D(100k), E(10k), F(500k). Keys: A joins to B on user_id; B to C on session_id; other joins are small lookups. Describe an optimal merge order and specific pandas techniques (indexing, dtype downcasting, selecting necessary columns, categorical codes, chunked merges) to minimize peak memory, including example code snippets.
MediumTechnical
0 practiced
A column df['metadata'] contains JSON strings such as '{"device":"mobile","os":"iOS"}'. Write pandas code to efficiently expand this column into separate columns 'device' and 'os' using pandas.json_normalize or df.apply + pd.Series. Show how to handle missing or malformed JSON rows gracefully without crashing the pipeline.
MediumTechnical
0 practiced
When should you prefer pivot_table over groupby + unstack? Given df with duplicates for some (store,date,product) combinations, write pandas code to create a matrix of summed sales with pivot_table using aggfunc='sum' and fill_value=0. Explain how pivot_table handles duplicates and compare performance.

Unlock Full Question Bank

Get access to hundreds of Pandas Data Manipulation and Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.