Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

MediumTechnical

0 practiced

Given a DataFrame 'df' with columns ['timestamp','device_id','value'] sampled irregularly, write pandas code to resample each device to 1-minute frequency, fill missing values using forward-fill but only for gaps up to 5 minutes, and compute a rolling 5-minute mean per device. Discuss trade-offs between upsampling and downsampling and memory implications of this approach.

HardTechnical

0 practiced

Case study: design an end-to-end preprocessing pipeline for tabular data used by multiple models. Requirements: accept CSV and Parquet inputs, detect schema drift, enforce data contracts, perform cleaning (missing values, dtype normalization), feature engineering, and output partitioned Parquet for training and inference. Using pandas as a core component, design the architecture and explain where to use supporting tools (Dask, DuckDB, Airflow/Prefect, pandera), testing practices, monitoring, scalability and reproducibility considerations.

HardSystem Design

0 practiced

Design a pandas-based approach to compute per-user aggregates from a 200GB CSV file that does not fit in memory. Provide a step-by-step plan and code sketches using pd.read_csv with chunksize, intermediate Parquet storage or serialized partial aggregates, and strategies to minimize memory and IO. Mention trade-offs and when to migrate to Dask or Spark.

EasyTechnical

0 practiced

Given a DataFrame in long format with columns ['date','store_id','product','sales'], write pandas code to pivot it to wide format with index ['date','store_id'] and columns for each product containing sales, filling missing entries with 0. Then show how to reverse the operation back to long format using melt.

MediumTechnical

0 practiced

You need to load a filtered subset of a large Postgres table into pandas for analysis. Explain how to use pandas.read_sql_query with SQLAlchemy to push down filters, use chunked reads or server-side cursors to avoid memory issues, and how to create a temporary table in the database for faster joins. Provide code sketches and safety considerations.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Join thousands of developers preparing for their dream job.