InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

MediumTechnical
0 practiced
Describe how to reshape a long DataFrame into a wide format using pivot_table and then revert it back to long using melt. Include an example where you aggregate duplicate index/value pairs and specify fill values for missing combinations.
HardTechnical
0 practiced
You must merge a very large transactional DataFrame with a dimension table that is small but repeated many times per transaction. Describe concrete pandas techniques to optimize this merge in memory and time, for example converting keys to categorical, setting index and joining on index, or performing the join outside pandas in a database. Provide rationale for each technique.
HardTechnical
0 practiced
Describe how you would adapt a pandas-based preprocessing function to work out-of-core using Dask or PySpark and explain how to maintain compatibility with existing pandas unit tests. Provide code sketches showing dask.dataframe.map_partitions or Spark pandas UDF usage.
MediumTechnical
0 practiced
A pandas data processing step is very slow. Describe tools and techniques you would use to profile and identify bottlenecks in CPU and memory. Mention use of line_profiler, memory_profiler, pandas eval and query with numexpr, vectorization opportunities, and when to consider parallelization.
MediumTechnical
0 practiced
Write Python code using pandas to read a very large CSV in chunks, compute per-user total spend across the whole file, and return a final DataFrame of user_id and total_spend. Explain how you accumulate partial aggregates efficiently and how to persist intermediate results if the process crashes.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.