Model Evaluation and Validation Questions

Comprehensive coverage of how to measure, validate, debug, and monitor machine learning model performance across problem types and throughout the development lifecycle. Candidates should be able to select and justify appropriate evaluation metrics for classification, regression, object detection, and natural language tasks, including accuracy, precision, recall, F one score, receiver operating characteristic area under the curve, mean squared error, mean absolute error, root mean squared error, R squared, intersection over union, and mean average precision, and to describe language task metrics such as token overlap and perplexity. They should be able to interpret confusion matrices and calibration, perform threshold selection and cost sensitive decision analysis, and explain the business implications of false positives and false negatives. Validation and testing strategies include train test split, holdout test sets, k fold cross validation, stratified sampling, and temporal splits for time series, as well as baseline comparisons, champion challenger evaluation, offline versus online evaluation, and online randomized experiments. Candidates should demonstrate techniques to detect and mitigate overfitting and underfitting including learning curves, validation curves, regularization, early stopping, data augmentation, and class imbalance handling, and should be able to debug failing models by investigating data quality, label noise, feature engineering, model training dynamics, and evaluation leakage. The topic also covers model interpretability and limitations, robustness and adversarial considerations, fairness and bias assessment, continuous validation and monitoring in production for concept drift and data drift, practical testing approaches including unit tests for preprocessing and integration tests for pipelines, monitoring and alerting, and producing clear metric reporting tied to business objectives.

HardTechnical

66 practiced

Implement a Python function that computes per-group recall (TP / (TP + FN)) for a dataset with columns: y_true, y_pred, group (sensitive attribute). The function should return per-group recall and the max-min disparity metric. Also outline a bootstrap procedure to compute 95% confidence intervals for each group's recall. You may assume use of numpy/pandas.

HardTechnical

73 practiced

For a generative language model (assistant) scheduled for deployment, propose a comprehensive validation suite that includes automated checks (toxicity, hallucination heuristics, perplexity baselines), human evaluation sampling strategy, and pass/fail criteria. Explain how you would integrate safety and quality metrics into release gating.

HardTechnical

67 practiced

For a rare-event healthcare prediction (0.5% positive), recommend an evaluation plan that ensures clinical safety: choice of metrics, test-set construction, threshold selection, calibration strategy, and monitoring cadence. Explain how you would measure and bound the risk of false negatives in production.

HardSystem Design

71 practiced

Design a continuous validation pipeline that: validates incoming training data, runs offline model validation (cross-validation and holdouts), computes drift metrics on recent production data, runs a champion-challenger comparison, and triggers retraining with human approval when necessary. Describe integration points with model registry, CI/CD, and dashboards that show decision rationale for retraining.

HardTechnical

72 practiced

You ran an online A/B test comparing two scoring models. Control: 200 conversions out of 10,000 users. Treatment: 240 conversions out of 10,000 users. Compute the z-statistic and p-value for difference in proportions (two-sided). Then estimate the minimum sample size per arm required to detect a 0.4 percentage point uplift with 80% power and alpha=0.05. Show formulas and assumptions.

Unlock Full Question Bank

Get access to hundreds of Model Evaluation and Validation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.