Exploratory Data Analysis Questions

Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.

HardTechnical

0 practiced

You have hundreds of features with suspected multicollinearity. Propose a practical, scalable approach to detect and mitigate multicollinearity: include approximate VIF computation for large feature sets, correlation-based feature clustering, PCA or truncated SVD options, use of regularized models, and a plan to preserve interpretability for stakeholders.

HardTechnical

0 practiced

A classification problem has an extremely imbalanced target and missing values that correlate strongly with the positive class. Propose an EDA-driven strategy for feature creation, sampling or weighting, validation schemes (e.g., stratified time splits), and steps to quantify the risk of optimistic bias or leakage introduced by handling the missingness.

MediumTechnical

0 practiced

You're preparing a churn model and notice 'last_login' is missing significantly more for customers who churned. During EDA, how would you test whether missingness is informative (predictive of churn) and what encoding or modeling choices would you make based on your findings?

MediumTechnical

0 practiced

Describe concrete feature engineering steps you would perform on a timestamp column for modeling purposes: extracting cyclical features (hour/day/season), creating lag and rolling statistics, handling timezone and daylight savings, and preventing leakage. Provide examples of transformations and when each is helpful during EDA.

MediumTechnical

0 practiced

Given a table events(user_id, event_date DATE, revenue DECIMAL), write ANSI SQL using window functions to compute a 7-day rolling average revenue per user and flag days where the daily revenue is greater than rolling_mean + 2 * rolling_stddev for that user. Include partitioning by user and appropriate window framing.

Unlock Full Question Bank

Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.

Join thousands of developers preparing for their dream job.