InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

EasyTechnical
0 practiced
Explain the difference between pandas merge types: inner, left, right, and outer. Given DataFrames df_users and df_orders with key 'user_id', write code to produce a left join that keeps all users and adds an order_count from df_orders aggregated by user. Mention handling duplicate keys and overlapping column names.
HardSystem Design
0 practiced
Design a pandas-based approach to compute per-user aggregates from a 200GB CSV file that does not fit in memory. Provide a step-by-step plan and code sketches using pd.read_csv with chunksize, intermediate Parquet storage or serialized partial aggregates, and strategies to minimize memory and IO. Mention trade-offs and when to migrate to Dask or Spark.
MediumTechnical
0 practiced
Explain why repeatedly using DataFrame.append or pd.concat inside a loop is slow. Provide an efficient pattern (code sketch) to accumulate rows or DataFrames inside a loop and produce a final DataFrame, including strategies like accumulating dictionaries/tuples or DataFrames in a list and doing a single pd.concat at the end.
MediumTechnical
0 practiced
Given DataFrames 'left' and 'right' both containing columns 'user_id','score', and 'updated_at', write pandas code to merge them on 'user_id' and resolve overlapping columns such that you prefer non-null 'score' from the 'right' DataFrame and keep the most recent 'updated_at'. Provide an approach that can generalize to many overlapping columns without hardcoding per-column logic.
MediumTechnical
0 practiced
Describe why vectorized operations with NumPy or pandas are usually faster than Python-level loops. Provide a short Python benchmarking example comparing a vectorized operation on a NumPy array or pandas Series versus an equivalent Python loop and explain the differences in memory access and CPU utilization.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.