Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.
HardTechnical
0 practiced
Design an anomaly-detection prototype (algorithmic steps or code outline in Python) that, given a numeric time series, uses STL decomposition to strip seasonality, computes residuals, uses MAD-based thresholds, and an adaptive EWMA to detect recent variance shifts. The output should be annotated time ranges of anomalies with severity scores. Describe computational complexity and parameters to tune.
HardTechnical
0 practiced
Discuss robust descriptive statistics useful for heavy-tailed financial metrics encountered during EDA: median, trimmed mean, winsorized mean, MAD, and robust covariance estimators. For each, explain advantages, limitations, and how choice impacts downstream model training and evaluation.
HardTechnical
0 practiced
Explain methods to discover non-linear and high-order interactions between features during EDA: mutual information, decision-tree-derived split importance, partial dependence and ICE plots, SHAP interaction values, and binned pairwise analysis. Provide a recommended workflow for prioritizing interactions to attempt in feature engineering for a medium-sized tabular dataset.
MediumTechnical
0 practiced
Write efficient pandas code to compute per-user aggregates from a large DataFrame transactions (user_id, amount, occurred_at): total_transactions, mean_amount, median_amount, std_amount, last_transaction_date. The code should handle missing amounts (treat as zero or ignore based on a parameter), set appropriate dtypes, and be memory-aware for 50M rows (hint: use chunking or groupby on categorical user_id).
MediumTechnical
0 practiced
Write PostgreSQL SQL that computes a 7-day moving average and flags anomaly days in table daily_metrics(date DATE, metric_value DOUBLE PRECISION) where the metric deviates by more than 3 standard deviations from the trailing 30-day mean. Return columns: date, metric_value, moving_avg_7d, trailing_mean_30d, trailing_std_30d, z_score, is_anomaly. Use window functions and explain performance considerations on large tables.
Unlock Full Question Bank
Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.