InterviewStack.io LogoInterviewStack.io

Model Evaluation and Quality Assessment Questions

Covers evaluation methods, metrics, and quality assessment approaches for machine learning models including both predictive models and generative models. Topics include selecting appropriate metrics such as accuracy, precision, recall, F one score, area under curve for ranking, root mean square error and mean absolute percentage error for regression, and the rationale for using multiple metrics and baselines. For generative and large language models, covers automatic metrics such as BLEU, ROUGE, METEOR, semantic similarity scores, LLM based evaluation techniques, human evaluation frameworks, factuality and hallucination checking, adversarial and stress testing, error analysis, and designing scalable, cost effective evaluation pipelines and quality assurance processes.

EasyTechnical
0 practiced
Explain precision, recall, specificity (true negative rate), and F1 score for binary classification. For each metric, state the formula using TP, FP, TN, FN; describe a scenario where it is the most important metric; and give one limitation. Provide a small confusion-matrix example (numbers) and compute all four metrics from it.
EasyTechnical
0 practiced
For a multi-label classification problem (each sample may have multiple labels), explain micro vs macro averaging for precision and recall and when Hamming loss is useful. Provide a short example with two samples and three possible labels to illustrate how micro/macro differ.
HardTechnical
0 practiced
Design an evaluation framework to measure fairness for a credit-scoring model across protected groups. Specify which fairness metrics you would compute (statistical parity, equalized odds, predictive parity), how to evaluate in the presence of label bias from historical discrimination, and how to present trade-offs and mitigation strategies to stakeholders.
MediumTechnical
0 practiced
You have a model predicting multiple correlated continuous targets (e.g., demand per region). How would you evaluate joint predictions versus marginal predictions? Discuss suitable metrics for overall performance, tests to check whether modeling correlations yields value, and how to present results to stakeholders.
MediumTechnical
0 practiced
For a search relevance problem, explain nDCG@k and mean Average Precision (mAP). When would you prefer nDCG over mAP? Illustrate with a small example: for a single query with graded relevance scores [3, 2, 0, 1], compute DCG@4 and nDCG@4 (briefly outline steps).

Unlock Full Question Bank

Get access to hundreds of Model Evaluation and Quality Assessment interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.