InterviewStack.io LogoInterviewStack.io

Multi Armed Bandits and Experimentation Questions

Covers adaptive experimentation methods that trade off exploration and exploitation to optimize sequential decision making, and how they compare to traditional A B testing. Core concepts include the exploration versus exploitation dilemma, regret minimization, reward modeling, and handling delayed or noisy feedback. Familiar algorithms and families to understand are epsilon greedy, Upper Confidence Bound, Thompson sampling, and contextual bandit extensions that incorporate features or user context. Practical considerations include when to choose bandit approaches versus fixed randomized experiments, designing reward signals and metrics, dealing with non stationary environments and concept drift, safety and business constraints on exploration, offline evaluation and simulation, hyperparameter selection and tuning, deployment patterns for online learning, and reporting and interpretability of adaptive experiments. Applications include personalization, recommendation systems, online testing, dynamic pricing, and resource allocation.

MediumTechnical
45 practiced
Write a SQL snippet that computes an IPS-weighted conversion rate estimate for a candidate policy that deterministically chooses action 'B' given logs with schema: logs(user_id, action, propensity, reward). Assume propensity is the probability logging policy chose action. Show the weighted estimator and a basic bootstrap variance estimate approach in SQL or pseudocode.
MediumTechnical
44 practiced
You're seeing signs of nonstationarity: an arm's conversion rate drifts downward over weeks. As the data analyst owning experiments, propose a detection and remediation plan that includes statistical tests, windowing strategies, and adaptive algorithm choices to reduce regret under drift.
EasyTechnical
69 practiced
As a data analyst, define regret in the context of multi-armed bandits. Given an arm reward sequence and a known optimal arm reward 0.6, explain how to compute cumulative regret for a sequence of observed rewards [0.4, 0, 0.6, 0.5]. Show the formula and one interpretation of the result for stakeholders.
MediumTechnical
42 practiced
Implement the UCB1 selection step in Python: given arrays counts (n_i) and sum_rewards (s_i) for k arms and current time t, produce the arm index to pull next using the standard UCB1 formula. Explain numerical stability concerns and tie-breaking behavior.
HardTechnical
41 practiced
You're responsible for ensuring fairness and regulatory compliance for a live bandit personalization system. As team lead, propose a governance plan that includes fairness metrics, auditing procedures, logging, and remediation pathways. Include how to present findings to legal and product teams.

Unlock Full Question Bank

Get access to hundreds of Multi Armed Bandits and Experimentation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.