InterviewStack.io LogoInterviewStack.io

Python for Data Analysis Questions

Covers the practical use of Python and its data libraries to perform data ingestion, cleaning, transformation, analysis, and aggregation. Candidates should be able to manipulate data frames, perform complex grouping and aggregation operations, merge and join multiple data sources, and implement efficient vectorized operations using libraries such as Pandas and NumPy. Expect to write clear, idiomatic Python with appropriate error handling, input validation, and small tests or assertions. At more senior levels, discuss performance trade offs and scalability strategies such as choosing NumPy vectorization versus Pandas, and when to adopt alternative tools like Polars or Dask for very large datasets, as well as techniques for memory management, profiling, and incremental or streaming processing. Also cover reproducibility, serialization formats, and integrating analysis into pipelines.

HardTechnical
59 practiced
Discuss trade-offs between NumPy vectorized algorithms and pandas groupby when aggregating millions of rows by category labels. Include examples where using integer-encoded group labels and numpy.bincount/numpy.add.at outperform pandas groupby, and when pandas' optimized C-groupby is preferable because of dtype handling and flexibility.
EasyTechnical
45 practiced
Explain the difference between pandas.DataFrame.pivot_table and using groupby + unstack for the same transformation. Provide a small example that shows how pivot_table handles aggregation and missing values differently and mention when you'd prefer one over the other in a BI context.
HardTechnical
52 practiced
Implement a threaded reader that reads multiple compressed CSV files concurrently and aggregates a metric (total revenue) in a thread-safe way using Python's threading or concurrent.futures. Explain GIL implications and why IO-bound workloads benefit from threads while CPU-bound tasks don't.
EasyTechnical
60 practiced
Describe why vectorized operations are preferred over row-wise Python loops when manipulating pandas Series or NumPy arrays. Give an example: compute a new column 'tax' as 8% of 'amount' but set tax to 0 when 'amount' < 1 using a vectorized expression.
HardTechnical
44 practiced
You are asked to generate reproducible synthetic datasets for testing reporting logic: design a Python utility that can generate deterministic synthetic orders with configurable cardinalities, date ranges, seasonality and anomalies. Explain how you'd seed randomness, structure the generator, and validate that generated datasets exercise edge cases.

Unlock Full Question Bank

Get access to hundreds of Python for Data Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.