InterviewStack.io LogoInterviewStack.io

Python for Data Analysis Questions

Covers the practical use of Python and its data libraries to perform data ingestion, cleaning, transformation, analysis, and aggregation. Candidates should be able to manipulate data frames, perform complex grouping and aggregation operations, merge and join multiple data sources, and implement efficient vectorized operations using libraries such as Pandas and NumPy. Expect to write clear, idiomatic Python with appropriate error handling, input validation, and small tests or assertions. At more senior levels, discuss performance trade offs and scalability strategies such as choosing NumPy vectorization versus Pandas, and when to adopt alternative tools like Polars or Dask for very large datasets, as well as techniques for memory management, profiling, and incremental or streaming processing. Also cover reproducibility, serialization formats, and integrating analysis into pipelines.

EasyTechnical
48 practiced
You have two tables as pandas DataFrames: 'customers' (customer_id, name, country) and 'orders' (order_id, customer_id, amount). Implement a robust left join in Python to attach customer info to orders, ensuring no accidental many-to-many explosion when keys are duplicated. Explain how you'd detect and handle unexpected duplicate keys in customers.
EasyTechnical
53 practiced
Implement Python code that parses a 'timestamp' column in various common formats ('2024-03-10 12:34:56', '03/10/2024 12:34', ISO strings) into pandas datetime, sets it to UTC, and resamples the data to daily total 'amount' for a KPI. Mention timezone-aware considerations for BI dashboards.
HardTechnical
92 practiced
You have a 20GB pandas DataFrame in production that you need to reduce to under 4GB for processing on a single machine. Provide a concrete, ordered plan with Python snippets: identify high-memory columns, convert dtypes, use categorical encodings, drop unneeded columns, and consider external storage or sampling. Include how you'd validate correctness after reductions.
MediumTechnical
63 practiced
Write unit tests for a transformation that groups transactions by month and product and then filters out products with fewer than 5 transactions per month. Describe test cases you would include to cover edge cases and how you'd mock time-dependent behavior.
HardSystem Design
64 practiced
Design how pre-aggregated metrics should be stored and served to an interactive dashboard that expects sub-second query latencies across many dimensions (date, region, product). Consider storage format, partitioning, materialized views, cache layers, and Python jobs that refresh pre-aggregates. Provide a recommended architecture and justify choices.

Unlock Full Question Bank

Get access to hundreds of Python for Data Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.