Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

HardSystem Design

0 practiced

Design a robust ETL pipeline using pandas for ingesting raw CSVs from S3, validating schema, transforming and normalizing columns, and writing partitioned parquet files for downstream model training. Include considerations for retries, atomic writes, incremental processing, schema evolution, monitoring, and memory constraints in your design.

HardTechnical

0 practiced

Design unit tests for a pandas data transformation function that normalizes column names, casts dtypes, and computes derived features. Show how to use pytest fixtures and pandas.testing.assert_frame_equal and describe property-based testing with hypothesis.extra.pandas to catch edge cases such as nulls and duplicate columns.

HardTechnical

0 practiced

You must merge a very large transactional DataFrame with a dimension table that is small but repeated many times per transaction. Describe concrete pandas techniques to optimize this merge in memory and time, for example converting keys to categorical, setting index and joining on index, or performing the join outside pandas in a database. Provide rationale for each technique.

MediumTechnical

0 practiced

You have minute-level events for IoT devices. Using pandas, show how you would resample to hourly metrics per device computing sum, mean, and number of observations, while handling missing periods by forward filling last known status and being timezone-aware. Provide code illustrating set_index, tz_localize, resample, and agg.

HardTechnical

0 practiced

An overnight pandas ETL job failed because a vendor changed delimiter and quoting in a CSV file. Describe step-by-step how you would triage the incident, repair the pipeline for that run, and implement preventative measures so similar changes are detected and handled automatically in future runs.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Join thousands of developers preparing for their dream job.