Covers the practical use of Python and its data libraries to perform data ingestion, cleaning, transformation, analysis, and aggregation. Candidates should be able to manipulate data frames, perform complex grouping and aggregation operations, merge and join multiple data sources, and implement efficient vectorized operations using libraries such as Pandas and NumPy. Expect to write clear, idiomatic Python with appropriate error handling, input validation, and small tests or assertions. At more senior levels, discuss performance trade offs and scalability strategies such as choosing NumPy vectorization versus Pandas, and when to adopt alternative tools like Polars or Dask for very large datasets, as well as techniques for memory management, profiling, and incremental or streaming processing. Also cover reproducibility, serialization formats, and integrating analysis into pipelines.
EasyTechnical
0 practiced
You have two tables as pandas DataFrames: 'customers' (customer_id, name, country) and 'orders' (order_id, customer_id, amount). Implement a robust left join in Python to attach customer info to orders, ensuring no accidental many-to-many explosion when keys are duplicated. Explain how you'd detect and handle unexpected duplicate keys in customers.
MediumTechnical
0 practiced
Explain how you'd profile a slow pandas pipeline that merges, groupbys, and applies a custom function. Describe tools and steps (line_profiler, pandas_profiling, memory_profiler, %timeit), what hotspots to look for, and how to interpret results to decide if vectorization or alternative tools are needed.
MediumTechnical
0 practiced
Implement code to compute percent change month-over-month for a metric and include error handling when months are missing or when denominators are zero. Return a DataFrame with columns ['month','metric','mom_pct_change'] and ensure percentage values are finite and human-readable.
HardTechnical
0 practiced
A nightly BI job started producing different KPI numbers after a schema change upstream. Outline a debugging plan in Python/pandas to identify the cause: steps to compare schemas, sample rows, null distributions, and row counts; write code to detect silently truncated columns and unexpected type changes.
EasyTechnical
0 practiced
Write a pandas snippet that computes, for each product_id, total revenue and average order amount for completed orders from a DataFrame 'orders'. Return a DataFrame with columns ['product_id','total_revenue','avg_order_amount','order_count']. Use vectorized groupby/agg and show sample output for input:product_id: [A,A,B], amount: [10,20,15], status: ['complete','complete','cancelled'].
Unlock Full Question Bank
Get access to hundreds of Python for Data Analysis interview questions and detailed answers.