InterviewStack.io LogoInterviewStack.io

Model Evaluation and Validation Questions

Comprehensive coverage of how to measure, validate, debug, and monitor machine learning model performance across problem types and throughout the development lifecycle. Candidates should be able to select and justify appropriate evaluation metrics for classification, regression, object detection, and natural language tasks, including accuracy, precision, recall, F one score, receiver operating characteristic area under the curve, mean squared error, mean absolute error, root mean squared error, R squared, intersection over union, and mean average precision, and to describe language task metrics such as token overlap and perplexity. They should be able to interpret confusion matrices and calibration, perform threshold selection and cost sensitive decision analysis, and explain the business implications of false positives and false negatives. Validation and testing strategies include train test split, holdout test sets, k fold cross validation, stratified sampling, and temporal splits for time series, as well as baseline comparisons, champion challenger evaluation, offline versus online evaluation, and online randomized experiments. Candidates should demonstrate techniques to detect and mitigate overfitting and underfitting including learning curves, validation curves, regularization, early stopping, data augmentation, and class imbalance handling, and should be able to debug failing models by investigating data quality, label noise, feature engineering, model training dynamics, and evaluation leakage. The topic also covers model interpretability and limitations, robustness and adversarial considerations, fairness and bias assessment, continuous validation and monitoring in production for concept drift and data drift, practical testing approaches including unit tests for preprocessing and integration tests for pipelines, monitoring and alerting, and producing clear metric reporting tied to business objectives.

MediumTechnical
0 practiced
Describe uplift modeling and how it differs conceptually from standard classification. As a BI Analyst running marketing campaigns, explain what evaluation data and metrics (e.g., Qini, uplift at decile, incremental revenue) you need to prove that targeting by uplift increases ROI versus a model predicting conversion probability.
HardTechnical
0 practiced
A lending model shows disparate denial rates for a protected group. As BI Analyst, propose a fairness evaluation plan: which fairness metrics (demographic parity, equalized odds, equal opportunity) you would compute, how you'd test statistical significance of disparities, and describe at least two mitigation strategies and their business trade-offs.
MediumTechnical
0 practiced
You are tuning a fraud detection classifier where only 1% of transactions are fraudulent. A false negative (missed fraud) costs $1,000; a false positive (blocking a legit transaction) costs $50. Describe a quantitative process to select a probability threshold that minimizes expected cost and how you'd validate that choice offline using available labeled data.
EasyTechnical
0 practiced
Contrast Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²) for regression problems from both a mathematical and business-impact perspective. For predicting delivery times (minutes), which metric would you recommend and why?
MediumTechnical
0 practiced
You have the following ground-truth and predictions for a single image: GT boxes: [(10,10,50,50), (60,60,100,100)] ; Predictions (box, score): [((12,12,48,48),0.9), ((11,11,51,51),0.6), ((60,60,100,100),0.4)]. Compute IoU for candidate matches and then compute AP for IoU threshold 0.5 (manually). Show your steps for sorting by score, matching, and computing precision/recall points used for AP.

Unlock Full Question Bank

Get access to hundreds of Model Evaluation and Validation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.