Pandas Data Manipulation and Analysis Questions
Data manipulation and analysis using the Pandas library: reading data from CSV or SQL sources, selecting and filtering rows and columns, boolean indexing, iloc and loc usage, groupby aggregations, merging and concatenating DataFrames, handling missing values with dropna and fillna, applying transformations via apply and vectorized operations, reshaping with pivot and melt, and performance considerations for large DataFrames. Includes converting SQL style logic into Pandas workflows for exploratory data analysis and feature engineering.
MediumTechnical
72 practiced
You detect duplicate orders in a dataset. Explain different deduplication strategies using pandas: drop_duplicates(keep='first'/'last'), sorting before drop, dedup by subset of columns, and marking duplicates for manual review. Provide code examples and discuss when deduplication should be done vs flagged for business review.
HardTechnical
56 practiced
You must optimize a pandas-heavy notebook that takes 2 hours to run: describe a systematic approach to reduce runtime using profiling, vectorization, reducing data volume, converting to categorical dtypes, chunking, and possibly moving heavy ops to a database or Spark. Provide prioritized steps and sample code snippets for the highest-impact changes.
HardTechnical
72 practiced
Design schema and serialization choices for a DataFrame that must be shared across processes and persisted to disk daily: discuss parquet vs feather vs csv, how to preserve categorical dtypes and datetimes, and how to version schema changes. Provide code to write and read parquet preserving categories and an approach to handle evolving schemas.
EasyTechnical
52 practiced
Given a sample DataFrame df with columns ['id', 'name', 'age', 'signup_date', 'score'] demonstrate with code and explanations the differences between df.loc, df.iloc and chained indexing. Show examples selecting rows 10-20, selecting by boolean condition (age > 30), selecting columns by label and by integer positions. Explain why chained indexing can be dangerous and how to avoid it.
HardTechnical
68 practiced
As a data scientist delivering features to production, how do you design unit and integration tests for pandas transformations? Describe a testing strategy that covers deterministic inputs, edge cases (NaNs, empty frames), schema validation, and performance smoke tests. Give examples of pytest test cases for a transform function that imputes and encodes features.
Unlock Full Question Bank
Get access to hundreds of Pandas Data Manipulation and Analysis interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.