InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

EasyTechnical
83 practiced
Explain with a short example how using NumPy arrays directly can speed up numeric computations in pandas. Given df with 10M rows and columns 'a' and 'b', show how to compute c = sqrt(a*a + b*b) using vectorized NumPy operations and compare conceptually to using a Python loop.
MediumTechnical
57 practiced
Show how to efficiently join a large events table with a small reference table (lookup) in pandas. Discuss strategies when the lookup is tiny versus when it is moderately sized (e.g., 100K rows): using map/replace for tiny ones, merging with category conversion, and indexing strategies. Provide sample code.
MediumSystem Design
73 practiced
Design a data ingestion step using pandas that reads input files, applies schema validation, logs row-level errors to a CSV for later inspection, and writes accepted records to parquet. Sketch the Python function signatures, error handling, and how you'd ensure idempotency of the ingestion job.
MediumTechnical
78 practiced
Explain how you would benchmark and compare pandas, Dask, and Modin for a numeric aggregation task on a dataset that is slightly larger than memory. What metrics would you measure, what environment factors matter, and how would you design fair experiments?
MediumTechnical
74 practiced
You have to compute month-over-month growth for many metrics in a wide DataFrame where columns are metrics per month (e.g., revenue_2024_01, revenue_2024_02...). Propose a pandas approach to compute percentage growth between consecutive months for each metric and pivot the result to a tidy long format for reporting.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.