InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

MediumTechnical
80 practiced
Explain how Apache Arrow and the pyarrow integration with pandas can improve performance for serialization and for zero-copy transfers between systems. Give examples of when to use feather (feather format) vs Parquet with pyarrow engine and how to convert between pandas DataFrame and pyarrow.Table efficiently. Mention compatibility considerations and limitations.
HardTechnical
72 practiced
Case study: design an end-to-end preprocessing pipeline for tabular data used by multiple models. Requirements: accept CSV and Parquet inputs, detect schema drift, enforce data contracts, perform cleaning (missing values, dtype normalization), feature engineering, and output partitioned Parquet for training and inference. Using pandas as a core component, design the architecture and explain where to use supporting tools (Dask, DuckDB, Airflow/Prefect, pandera), testing practices, monitoring, scalability and reproducibility considerations.
MediumTechnical
81 practiced
Explain why repeatedly using DataFrame.append or pd.concat inside a loop is slow. Provide an efficient pattern (code sketch) to accumulate rows or DataFrames inside a loop and produce a final DataFrame, including strategies like accumulating dictionaries/tuples or DataFrames in a list and doing a single pd.concat at the end.
EasyTechnical
69 practiced
Write a pandas snippet that groups a DataFrame 'transactions' by 'user_id' and computes total amount, transaction count, and average amount per user. Return a DataFrame with columns ['user_id','total','count','avg'] and ensure results are sorted by 'total' descending. Use method chaining where appropriate.
MediumTechnical
68 practiced
You are given a DataFrame 'df' where 'id' values appear numeric but contain trailing spaces and 'amount' is stored as object with commas and currency symbols. Write a robust pandas routine to clean and convert 'id' to integer (nullable if needed) and 'amount' to float, handling missing values and producing a small report of rows that failed conversion.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.