InterviewStack.io LogoInterviewStack.io

Pandas Data Manipulation and Analysis Questions

Data manipulation and analysis using the Pandas library: reading data from CSV or SQL sources, selecting and filtering rows and columns, boolean indexing, iloc and loc usage, groupby aggregations, merging and concatenating DataFrames, handling missing values with dropna and fillna, applying transformations via apply and vectorized operations, reshaping with pivot and melt, and performance considerations for large DataFrames. Includes converting SQL style logic into Pandas workflows for exploratory data analysis and feature engineering.

HardTechnical
0 practiced
A standard df.groupby(['user_id','event_type'])['value'].agg(['sum','count']) on a 200M-row DataFrame is slow and memory-heavy. Describe concrete pandas-based strategies to optimize this aggregation including dtype tuning, categorical conversion, chunked aggregation, and potential use of alternative backends. Provide code examples and discuss trade-offs.
HardSystem Design
0 practiced
Your daily ETL currently processes many CSV files sequentially using pandas. Propose ways to parallelize processing across CPU cores on a single machine and implement an example using concurrent.futures to process files in parallel. Discuss GIL implications, pickling overhead, IO contention, memory limits, and alternatives like Dask (local scheduler).
EasyTechnical
0 practiced
Given a DataFrame df with columns ['user_id', 'event', 'timestamp', 'properties'], demonstrate how to select rows where event == 'purchase' and timestamp is between two given dates using boolean indexing and .loc. Explain the difference between .loc and .iloc, and show example code that slices rows and selects only columns user_id and timestamp.
HardSystem Design
0 practiced
You maintain a pandas ETL that reads many CSVs, performs merges and groupbys, and writes Parquet. The team wants to migrate to a distributed system. For each ETL step (ingest, join, groupby, write), describe how you would map pandas code to Dask and Spark equivalents, note API differences, and highlight cluster considerations (shuffle, partitioning, memory). Give brief code examples illustrating key differences.
HardTechnical
0 practiced
How would you unit-test pandas transformation functions used in your data pipeline? Provide pytest examples that validate schema (expected columns and dtypes), value ranges, null handling, and invariants. Show how to create small synthetic DataFrames as fixtures and how to assert DataFrame equality robustly.

Unlock Full Question Bank

Get access to hundreds of Pandas Data Manipulation and Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.