InterviewStack.io LogoInterviewStack.io

Python for Data Analysis Questions

Covers the practical use of Python and its data libraries to perform data ingestion, cleaning, transformation, analysis, and aggregation. Candidates should be able to manipulate data frames, perform complex grouping and aggregation operations, merge and join multiple data sources, and implement efficient vectorized operations using libraries such as Pandas and NumPy. Expect to write clear, idiomatic Python with appropriate error handling, input validation, and small tests or assertions. At more senior levels, discuss performance trade offs and scalability strategies such as choosing NumPy vectorization versus Pandas, and when to adopt alternative tools like Polars or Dask for very large datasets, as well as techniques for memory management, profiling, and incremental or streaming processing. Also cover reproducibility, serialization formats, and integrating analysis into pipelines.

EasyTechnical
53 practiced
Given a pandas DataFrame with both numeric and categorical columns, write Python code that fills missing numeric values with the column median and fills categorical columns with the mode. Include handling for columns that are all NaNs and avoid modifying the original DataFrame in-place.
EasyTechnical
63 practiced
Explain what vectorized operations mean in the context of pandas/NumPy and why they tend to be faster than row-wise apply. Provide a short example that shows computing a new column using a vectorized expression vs using apply, and explain which is preferable and why.
MediumTechnical
60 practiced
A stakeholder asks for joining near-real-time clickstream events with a static user profile table to enrich events for downstream dashboards. Sketch a prototype approach in Python: which tools (e.g., Kafka, Spark Structured Streaming, Faust, Dask), how you'd perform joins, acceptable latency trade-offs, and how to validate freshness of joined profiles.
MediumTechnical
44 practiced
You must compute a 7-day rolling average per product when timestamps are irregular and products have different event frequencies. Explain how you'd process this in pandas: whether to resample to regular frequency before rolling, how to group by product, and how to handle gaps and edge effects. Provide code snippets.
HardTechnical
58 practiced
Propose a strategy to integrate unit-tested pandas transformations into a CI/CD pipeline that enforces data contracts. Specify Python tooling (examples: pytest, pandera/great_expectations, pre-commit, black), how to structure code and tests, and how to gate deployment on data schema checks.

Unlock Full Question Bank

Get access to hundreds of Python for Data Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.