Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.
MediumTechnical
0 practiced
Compare histograms, boxplots, violin plots, and KDE/density plots: describe when each visualization is most appropriate during EDA, what aspects of the distribution they emphasize (spread, multimodality, tails), and give a concrete example of how you'd use each to analyze a skewed numeric feature before modeling.
HardTechnical
0 practiced
Design EDA methods to evaluate instance segmentation mask quality: compute IoU distribution between annotators and against a reference, per-class IoU statistics, boundary-smoothness measures (e.g., Hausdorff distance), and create visualization panels that overlay masks and highlight areas of disagreement. Describe automation techniques to flag problematic masks for re-annotation.
HardTechnical
0 practiced
Implement a Python function that computes Mahalanobis distance-based outliers for a numeric DataFrame X. Use a robust covariance estimator (e.g., sklearn.covariance.MinCovDet) to compute center and covariance, compute squared Mahalanobis distances, and flag rows with distance > chi2.ppf(0.975, df=n_features). Explain limitations and alternatives for high-dimensional data.
MediumSystem Design
0 practiced
Design a reproducible EDA notebook structure for team consumption: list key sections (data loading with fixed hashes, schema/profile, exploratory charts, hypothesis tests, findings), explain parameterization (dataset path, sample size, seed), artifact storage (plots, CSV summary), and how to integrate notebooks with version control and CI to run lightweight sanity checks automatically.
EasyTechnical
0 practiced
You discover that about 15% of rows in a tabular training dataset look duplicated. Describe the steps and SQL/pandas queries you would use to detect exact duplicates and near-duplicates, how you would confirm whether duplicates are legitimate repetitions or errors (use timestamps, IDs, business rules), and how you'd decide which duplicates to remove or keep for model training.
Unlock Full Question Bank
Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.