InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

EasyTechnical
0 practiced
Describe common techniques to detect and handle missing values in pandas. Give examples using dropna with thresholds, fillna with column-specific strategies, forward-fill/backward-fill within groups, and interpolation for time series. Include sample code for group-wise forward-fill limited to 1 consecutive NaN.
MediumTechnical
0 practiced
Describe practical steps to reduce the memory footprint of a pandas DataFrame with many columns and millions of rows. Cover: downcasting numeric types, converting strings to categorical, using nullable dtypes, dropping unused columns, using sparse dtypes, and when to persist to on-disk formats. Include code examples for downcasting numeric columns.
HardSystem Design
0 practiced
Design a memory-efficient pipeline to process a 500GB CSV dataset on a single machine with 32GB RAM. You may use pandas plus other Python libraries. Describe ingestion, schema inference, chunking strategy, intermediate storage format (e.g., parquet), deduplication, and final aggregation. Include failure recovery and how you'd validate correctness.
MediumTechnical
0 practiced
You have time-series events per user. For each user compute a 7-day rolling sum of 'amount' aligned to the right (i.e., window ending at each timestamp). Demonstrate a memory-efficient pandas solution that works per user and preserves original index order. Discuss using groupby + rolling vs groupby + apply and which is faster.
EasyTechnical
0 practiced
When should you use vectorized pandas/NumPy operations instead of df.apply or Python loops? Give a concrete example where a loop is replaced by a vectorized expression using NumPy broadcasting or pandas builtins, and explain the performance differences and readability trade-offs.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.