InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

EasyTechnical
58 practiced
Describe common strategies in pandas to detect and handle missing values. Give code examples for: counting missing values per column and per row, replacing sentinel values like -1 or 'NA' with NaN, dropping or imputing missing values, and using fillna with forward/backward fill. Explain when to prefer dropna vs imputation and pitfalls with inplace operations.
HardTechnical
83 practiced
You observe a pandas ETL step that groups and merges taking 10x longer than expected on a 10M-row DataFrame. Describe how you would profile the code to find bottlenecks (tools and methods), common causes of slowness in pandas operations, and specific optimizations you would try (e.g., changing dtypes, using categorical, pre-sorting, index usage). Provide a prioritized action list.
MediumTechnical
80 practiced
Explain how Apache Arrow and the pyarrow integration with pandas can improve performance for serialization and for zero-copy transfers between systems. Give examples of when to use feather (feather format) vs Parquet with pyarrow engine and how to convert between pandas DataFrame and pyarrow.Table efficiently. Mention compatibility considerations and limitations.
MediumTechnical
70 practiced
Create a pivot table from a 'sales' DataFrame grouped by 'region' and 'year' with aggregated metrics: sum of revenue, mean price, and count of transactions. The resulting DataFrame should have flat column names like 'revenue_sum', 'price_mean', 'transactions_count'. Provide pandas code that performs the aggregation and flattens MultiIndex column names.
HardTechnical
72 practiced
Describe best practices to integrate pandas preprocessing steps with scikit-learn pipelines. Cover feature engineering, handling categorical variables, scaling, imputation, and ensuring identical transforms at training and inference. Provide a code outline using sklearn.compose.ColumnTransformer and FunctionTransformer for custom pandas transforms.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.