InterviewStack.io LogoInterviewStack.io

Exploratory Data Analysis Questions

Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.

MediumTechnical
0 practiced
You have two CSV files (gold_labels.csv and annotator_labels.csv) for a text classification task, both with columns ['id','label']. Write Python code (pandas + scikit-learn) to merge the files, compute a confusion matrix and Cohen's kappa score. Explain interpretation of kappa values and practical thresholds to flag low-quality annotations, and describe next steps if agreement is low (adjudication or retraining annotators).
HardTechnical
0 practiced
You inherit a named-entity recognition (NER) dataset annotated in BIO format but find overlapping spans, inconsistent entity types, and non-canonical whitespace. Outline an EDA plan to detect schema violations and quantify their frequency (e.g., percent of examples with overlapping spans), provide pandas/regex checks to detect common issues, and recommend remediation steps (automatic normalization, rule-based fixes, or re-annotation/adjudication).
MediumTechnical
0 practiced
Design an EDA plan for a corpus of 10M text documents intended to fine-tune a Transformer: specify sampling strategies (reservoir sampling, stratified by source/date), token-length distribution analysis, outlier detection for extremely long documents, vocabulary coverage and rare-token analysis, compute/tokenization cost estimates, and storage/compute considerations for running analyses at scale.
HardTechnical
0 practiced
You suspect part of your training set was adversarially poisoned (e.g., label flips, trigger patterns). During EDA, what signatures would you look for to detect poisoning (clusters of high-loss training examples, abnormal label-feature correlations, near-duplicate inputs with conflicting labels), and what mitigations would you propose (filtering by anomaly score, robust training, human inspection, or influence-function analysis)?
HardTechnical
0 practiced
Implement a Python function that computes Mahalanobis distance-based outliers for a numeric DataFrame X. Use a robust covariance estimator (e.g., sklearn.covariance.MinCovDet) to compute center and covariance, compute squared Mahalanobis distances, and flag rows with distance > chi2.ppf(0.975, df=n_features). Explain limitations and alternatives for high-dimensional data.

Unlock Full Question Bank

Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.