InterviewStack.io LogoInterviewStack.io

A and B Test Design Questions

Designing and running A and B tests and split tests to evaluate product and feature changes. Candidates should be able to form clear null and alternative hypotheses, select appropriate primary metrics and guardrail metrics that reflect both product goals and user safety, choose randomization and assignment strategies, and calculate sample size and test duration using power analysis and minimum detectable effect reasoning. They should understand applied statistical analysis concepts including p values confidence intervals one tailed and two tailed tests sequential monitoring and stopping rules and corrections for multiple comparisons. Practical abilities include diagnosing inconclusive or noisy experiments detecting and mitigating common biases such as peeking selection bias novelty effects seasonality instrumentation errors and network interference and deciding when experiments are appropriate versus alternative evaluation methods. Senior candidates should reason about trade offs between speed and statistical rigor plan safe rollouts and ramping define rollback plans and communicate uncertainty and business implications to technical and non technical stakeholders. For developer facing products candidates should also consider constraints such as small populations cross team effects ethical concerns and special instrumentation needs.

MediumTechnical
0 practiced
How would you detect and mitigate selection bias and novelty effects in an experiment where traffic comes from a new marketing campaign (new users more likely to be in treatment)? Describe design choices and statistical checks you would run post-hoc.
MediumTechnical
0 practiced
You're testing 4 prompt templates for a generation model. Describe statistical corrections for multiple comparisons (Bonferroni, Holm, Benjamini-Hochberg FDR). Which would you choose to balance speed and false-positive control in product experiments and why?
HardTechnical
0 practiced
Instrumentation logs show a 5% difference in event counts between control and treatment due to a logging bug in the treatment. How would you detect, quantify the bias introduced into metrics, and correct the analysis? Discuss when a correction is acceptable vs. when to rerun the experiment.
MediumTechnical
0 practiced
In Python, write code or outline the steps to estimate the required sample size per group for a binary conversion metric. Use baseline conversion p0=0.05, desired MDE (absolute) = 0.005, alpha=0.05, power=0.8. Show formulas or use an appropriate library and state assumptions.
HardTechnical
0 practiced
Discuss methods to estimate heterogeneous treatment effects (HTE) from A/B tests for personalization: uplift modeling, causal forests (e.g., GRF), and Bayesian hierarchical models. Compare their assumptions, interpretability, and how you'd validate discovered subgroups before personalizing treatments.

Unlock Full Question Bank

Get access to hundreds of A and B Test Design interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.