InterviewStack.io LogoInterviewStack.io

Metrics, Guardrails, and Evaluation Criteria Questions

Design appropriate success metrics for experiments. Understand primary metrics, secondary metrics, and guardrail metrics. Know how to choose metrics that align with business goals while avoiding unintended consequences.

EasyTechnical
76 practiced
Provide a concrete example where optimizing a single metric leads to unintended behavior (reward-hacking) in a production model. Explain how you would detect that reward-hacking is occurring and propose at least two changes to metrics, evaluation protocol, or monitoring to mitigate the issue.
HardTechnical
65 practiced
Design metrics and experiments to measure knowledge retention and catastrophic forgetting in a continual learning setup. Include formal definitions for forward and backward transfer, average incremental accuracy, and forgetting scores. Describe dataset construction, baselines, and statistical tests to convincingly show an improvement over existing methods.
MediumTechnical
55 practiced
You need to compare models across accuracy, latency, and energy consumption. Discuss approaches to construct composite metrics or use multi-objective evaluation, explain how Pareto frontiers are constructed and interpreted, and describe how you would select a model for deployment from the Pareto-optimal set given stakeholder constraints.
EasyTechnical
76 practiced
List and contrast common automated evaluation metrics for generative text models (BLEU, ROUGE, METEOR, BERTScore, perplexity) with human evaluation dimensions (helpfulness, coherence, factuality). Propose a hybrid evaluation protocol for a research comparison that balances scalability and reliability, including sampling and statistical aggregation approaches.
HardTechnical
56 practiced
Propose a novel, partially automatable metric for open-domain dialogue that jointly captures coherence, helpfulness, and safety while scaling to large evaluations. Define the metric formally (component scores and aggregation), describe automatic signals (e.g., entailment checks, toxicity classifiers, dialogue act predictors), and design a validation study tying the metric to human judgments.

Unlock Full Question Bank

Get access to hundreds of Metrics, Guardrails, and Evaluation Criteria interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.