InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

MediumTechnical
69 practiced
In Python using pandas, how would you read a very large CSV stored on S3 into a DataFrame while minimizing memory usage and parsing timestamps? Explain and provide sample code showing use of read_csv parameters such as dtype, parse_dates, infer_datetime_format, usecols, memory_map, and chunksize. Describe tradeoffs of chunked reading versus reading with pyarrow engine and parquet conversion.
HardTechnical
81 practiced
Schema changes occur frequently and a downstream model expects stable data types. Propose a robust pandas-based casting strategy to handle schema evolution safely across releases, including use of pandas nullable dtypes, explicit casting maps, and detection of breaking changes during pipeline runs.
EasyTechnical
76 practiced
You have a DataFrame of sensor readings with many missing values across columns. Describe strategies in Python pandas to handle missing data for exploratory analysis and for preparing features for a model. Include examples for dropna, fillna, interpolation, forward fill, and groupwise imputation in code.
HardTechnical
78 practiced
You have a DataFrame column containing nested lists of tags for each document. For a very large dataset, implement an efficient way to flatten the tags into rows and keep a mapping to the original document id without creating excessive intermediate objects. Discuss memory-efficient numpy or itertools patterns versus pandas.explode.
MediumTechnical
74 practiced
Compare performing aggregations and joins in pandas versus delegating them to a relational database. For a typical data science preprocessing workload, when would you rely on SQL and when on pandas? Discuss tradeoffs including I/O, concurrency, transformation complexity, and developer productivity.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.