InterviewStack.io LogoInterviewStack.io

Handling Class Imbalance Questions

Addressing scenarios where one class significantly outnumbers others (common in fraud, churn, disease detection). Problems: accuracy becomes misleading (95% accuracy can be trivial if 95% are negative class), model biased toward majority class. Solutions: Resampling (undersampling majority, oversampling minority, or SMOTE), adjusting class weights in loss function, choosing appropriate metrics (F1, precision-recall instead of accuracy), ensemble methods. For junior level, recognize imbalance problems, understand why accuracy fails, and know multiple approaches to handling it.

MediumTechnical
0 practiced
Label noise scenario: minority-class labels are noisy due to human annotation errors. Provide a prioritized list of strategies to mitigate label noise given a limited labeling budget: how would you estimate noise rate, decide which examples to relabel, and which robust loss/functions or training strategies to apply?
EasyTechnical
0 practiced
Describe stratified sampling for train/validation/test splits and for cross-validation in the presence of class imbalance. Explain how StratifiedKFold preserves class ratios, what can go wrong if you use random splits, and how to handle extremely rare classes (e.g., <10 positives) when splitting data.
HardTechnical
0 practiced
Case study: You work on fraud detection where a false negative costs $10,000 and a false positive costs $200. The base rate of fraud is 0.05%. Assuming a calibrated probability model, compute the cost-minimizing decision threshold and explain how you would validate that threshold empirically and present the decision to executives, including confidence bounds on expected cost.
MediumTechnical
0 practiced
Practical question: you must pick a decision threshold so that model recall >= 0.90 while minimizing false positives. Describe a reproducible approach using cross-validation and the validation set to select the threshold under heavy class imbalance, how to estimate expected false positives per day given 1M daily events, and how to account for label lag in estimating real recall.
MediumTechnical
0 practiced
Coding task (medium): implement a simplified SMOTE function in Python for numeric features. Signature: def simple_smote(X_minority, n_samples, k=5, random_state=None): -> returns numpy array of synthetic samples. Use Euclidean distance and nearest neighbors; discuss time and memory complexity and how to scale to large minority sets.

Unlock Full Question Bank

Get access to hundreds of Handling Class Imbalance interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.