InterviewStack.io LogoInterviewStack.io

Exploratory Data Analysis Questions

Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.

HardTechnical
0 practiced
Design an EDA workflow to detect and quantify potential bias in a dataset with respect to protected attributes (e.g., gender, race). Include statistical tests, disaggregated performance checks, subgroup sample sizes and confidence intervals, fairness metrics (demographic parity difference, equal opportunity), visualization choices, and how to prepare a diagnostic report and communicate risks to non-technical stakeholders.
HardSystem Design
0 practiced
You manage IoT time-series data for millions of devices. Design efficient SQL and Spark-based aggregations to compute seasonality metrics per entity (autocorrelation, daily/weekly aggregates), detect anomalous temporal behavior, and produce a compact per-entity summary. Discuss storage layout (Parquet partitioning), incremental computation, and strategies to limit compute costs while scaling.
HardTechnical
0 practiced
Implement a memory-efficient Python function to compute the pairwise Pearson correlation matrix for a very large dense numeric matrix (n_samples x n_features) by using block-wise operations and BLAS-backed numpy operations to limit peak memory. Discuss parallelization options, numerical stability concerns, and how you'd validate correctness against a full in-memory computation on small data.
MediumTechnical
0 practiced
Implement a Python function to compute Population Stability Index (PSI) between two numeric distributions (e.g., training and live) with configurable number of bins and binning approach (quantiles vs fixed). Show how you choose bins and interpret PSI thresholds (e.g., <0.1 stable, 0.1-0.2 moderate, >0.2 large). Include edge-case handling for zero counts.
HardTechnical
0 practiced
Design EDA methods to evaluate instance segmentation mask quality: compute IoU distribution between annotators and against a reference, per-class IoU statistics, boundary-smoothness measures (e.g., Hausdorff distance), and create visualization panels that overlay masks and highlight areas of disagreement. Describe automation techniques to flag problematic masks for re-annotation.

Unlock Full Question Bank

Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.