InterviewStack.io LogoInterviewStack.io

Debugging and Recovery Under Pressure Questions

Covers systematic approaches to finding and fixing bugs during time pressured situations such as interviews, plus techniques for verifying correctness and recovering gracefully when an initial approach fails. Topics include reproducing the failure, isolating the minimal failing case, stepping through logic mentally or with print statements, and using binary search or divide and conquer to narrow the fault. Emphasize careful assumption checking, invariant validation, and common error classes such as off by one, null or boundary conditions, integer overflow, and index errors. Verification practices include creating and running representative test cases: normal inputs, edge cases, empty and single element inputs, duplicates, boundary values, large inputs, and randomized or stress tests when feasible. Time management and recovery strategies are covered: prioritize the smallest fix that restores correctness, preserve working state, revert to a simpler correct solution if necessary, communicate reasoning aloud, avoid blind or random edits, and demonstrate calm, structured troubleshooting rather than panic. The goal is to show rigorous debugging methodology, build trust in the final solution through targeted verification, and display resilience and recovery strategy under interview pressure.

HardTechnical
0 practiced
Implement a Python utility that finds the smallest batch size that causes model outputs to contain NaN values. The function signature is:
python
def find_nan_batch_size(run_fn, min_bs=1, max_bs=1024, retries=2):
    """run_fn(batch_size) -> True if run succeeded without NaN, False otherwise"""
Return the minimal batch size that causes NaN or None if none found. Consider flaky runs, timeouts, and efficiency (use exponential + binary search). Provide robust code.
HardTechnical
0 practiced
A long 48-hour training job produced corrupted checkpoint files at the end. You have limited compute budget to retrain. Propose a practical recovery strategy to salvage model performance: consider partial checkpoint repair, warm-starting, curriculum retraining, and data augmentation or distillation to reduce retrain time.
MediumTechnical
0 practiced
You just rolled out a new model that causes user-visible regressions. Design a short recovery plan that reverts to a stable model version with minimal downtime and traceability. Include steps for verification, canarying, rollback, and post-rollback validation.
MediumTechnical
0 practiced
A production model shows an abrupt 10% drop in accuracy. You have 1 hour to identify root cause and propose a mitigation plan. Provide a prioritized timeline with the most important checks, the data and artifacts you would inspect, and short-term mitigations to restore service while investigating.
MediumTechnical
0 practiced
You discover the evaluation pipeline uses a different tokenizer than inference, producing inconsistent metrics. Under time pressure, give a step-by-step plan to align evaluation and inference quickly, verify correctness on a small set, and prevent this class of mismatch from recurring.

Unlock Full Question Bank

Get access to hundreds of Debugging and Recovery Under Pressure interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.