InterviewStack.io LogoInterviewStack.io

Online Experimentation and Model Validation Questions

Running experiments in production to validate model changes and measure business impact. Topics include splitting traffic across model variants canary deployments and champion challenger testing selecting metrics that capture both model performance and business outcomes performing sample size and test duration calculations accounting for statistical power and multiple testing adjustments and handling instrumentation and novelty bias. Candidates should be able to analyze heterogeneous treatment effects monitor experiments in real time and design ramping plans and rollback guardrails to protect user experience and business metrics. The topic also covers decision rules for when to rely on offline evaluation versus online experiments and how to interpret differences between offline model metrics and live user outcomes as part of model validation and deployment strategy.

EasyTechnical
51 practiced
You must compute the per-variant sample size required to detect an absolute uplift of 0.5 percentage points from a baseline conversion rate of 5% with 80% power and two-sided alpha=0.05. Show the formula you would use (normal approximation) and compute the numeric sample size per arm. Explain assumptions and limitations of the approximation.
HardSystem Design
43 practiced
Design a real-time anomaly detection system that monitors experiments and raises alerts when statistically significant regressions occur. Describe detection algorithms (CUSUM, EWMA, changepoint detection), how to set thresholds to balance false positives/negatives, and integration with escalation playbooks to pause or rollback traffic.
MediumTechnical
45 practiced
Implement a Python function compute_sample_size(p_baseline, uplift_abs, power, alpha, ratio=1.0) that returns the required sample size per group using normal approximation for difference in proportions. Include input validation and an example call. You may assume two-sided tests and want integer sample sizes.
HardSystem Design
86 practiced
Design a canary deployment for a heavy deep-learning model that requires GPUs and increases inference latency. Address routing decisions between CPU fallback and GPU model, autoscaling, cost control, latency SLAs, data collection for metrics, and rollback strategies when latency or error budgets are exceeded.
EasyTechnical
48 practiced
For a personalized feature that varies per device, compare the pros and cons of randomizing at the user, session, device, and cookie levels. For each unit, describe likely contamination modes, impact on estimator variance, and product-experience trade-offs. Recommend the best unit for long-lived personalization and justify your choice.

Unlock Full Question Bank

Get access to hundreds of Online Experimentation and Model Validation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.