InterviewStack.io LogoInterviewStack.io

Python Data Manipulation with Pandas Questions

Skills and concepts for extracting, transforming, and preparing tabular and array data in Python using libraries such as pandas and NumPy. Candidates should be comfortable reading data from common formats, working with pandas DataFrame and Series objects, selecting and filtering rows and columns, boolean indexing and query methods, groupby aggregations, sorting, merging and joining dataframes, reshaping data with pivot and melt, handling missing values, and converting and validating data types. Understand NumPy arrays and vectorized operations for efficient numeric computation, when to prefer vectorized approaches over Python loops, and how to write readable, reusable data processing functions. At higher levels, expect questions on memory efficiency, profiling and optimizing slow pandas operations, processing data that does not fit in memory, and designing robust pipelines that handle edge cases and mixed data types.

MediumTechnical
0 practiced
Describe practical steps to reduce the memory footprint of a pandas DataFrame with many columns and millions of rows. Cover: downcasting numeric types, converting strings to categorical, using nullable dtypes, dropping unused columns, using sparse dtypes, and when to persist to on-disk formats. Include code examples for downcasting numeric columns.
HardTechnical
0 practiced
You're ingesting CSVs that sometimes contain malformed rows: missing delimiters, stray quotes, and inconsistent headers. Design a robust pandas-based reader that can detect bad rows, log them to a quarantine file with line numbers, attempt best-effort parsing, and continue processing. Describe heuristics to detect schema drift and when to fail-fast.
MediumTechnical
0 practiced
You have a pandas pipeline that is slow. Explain how you would profile it to find hotspots using tools like timeit, cProfile, line_profiler, and memory_profiler. Show an example of measuring a particular DataFrame operation and interpreting results to decide whether to vectorize, use numexpr, or offload to another engine.
EasyTechnical
0 practiced
A DataFrame has columns that are strings representing numbers and dates. Show pandas code to convert: 'amount_str' -> numeric with coercion and reporting rows that failed conversion, and 'date_str' -> datetime with timezone-aware parsing. Explain use of errors='coerce' and how to downcast numeric types for memory savings.
HardTechnical
0 practiced
You have a CPU-bound groupby that processes many unique keys and is slower than expected. Describe techniques to speed up the operation in pandas: e.g., downcast numeric dtypes, convert keys to categorical, sort-then-group, using Index-based joins, or moving to alternatives like numba, cython, or switching to Dask/PySpark. Provide a realistic step-by-step plan.

Unlock Full Question Bank

Get access to hundreds of Python Data Manipulation with Pandas interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.