Debugging and Troubleshooting AI Systems Questions
Covers systematic approaches to find and fix failures in machine learning and artificial intelligence systems. Topics include common failure modes such as poor data quality, incorrect preprocessing, label errors, data leakage, training instability, vanishing or exploding gradients, numerical precision issues, overfitting and underfitting, optimizer and hyperparameter problems, model capacity mismatch, implementation bugs, hardware and memory failures, and production environment issues. Skills and techniques include data validation and exploratory data analysis, unit tests and reproducible experiments, sanity checks and simplified models, gradient checks and plotting training dynamics, visualizing predictions and errors, ablation studies and feature importance analysis, logging and instrumentation, profiling for latency and memory, isolating components with canary or shadow deployments, rollback and mitigation strategies, monitoring for concept drift, and applying root cause analysis until the underlying cause is found. Interviewers assess the candidate on their debugging process, ability to isolate issues, use of tools and metrics for diagnosis, trade offs in fixes, and how they prevent similar failures in future iterations.
HardSystem Design
66 practiced
Design a centralized observability system that collects and aggregates signals from thousands of distributed training jobs (loss curves, gradient norms, NaN counters, GPU metrics) so SREs can detect training instability patterns and root causes. Describe core components, data schemas, retention strategy, rollup/aggregation logic for high-cardinality job labels, and how to balance fidelity and storage cost.
HardTechnical
45 practiced
One hour before a critical release the main training pipeline fails because the dataset snapshot is corrupted. As acting SRE lead, outline triage steps, options (delay release, retrain on older snapshot, remove problematic features), trade-offs for each option, and your communication strategy to stakeholders including what to document for the postmortem.
MediumTechnical
42 practiced
A large backfill updated historical features and a subsequent batch of predictions shows degraded performance. How would you validate the backfill to identify if partitioning bugs or timezone conversions introduced incorrect values? Provide concrete checks, sample queries, and quick rollback steps to limit user impact.
EasyTechnical
36 practiced
Describe three practical techniques (automated and manual) to detect label errors in a supervised dataset of 10 million examples intended for a binary classifier. For each technique explain why it surfaces label errors, how you prioritize samples for human review given limited labeling budget, and an example metric that would guide triage.
MediumTechnical
40 practiced
Inference pods for a model show memory usage slowly drifting upward until OOM kills and restarts the container. Describe a stepwise approach to determine whether the leak is in model code, a third-party library, the Python runtime/Garbage Collector, or in external system resources. Include the commands, profiling techniques (e.g., tracemalloc, pmap), and quick mitigations you would apply during the incident.
Unlock Full Question Bank
Get access to hundreds of Debugging and Troubleshooting AI Systems interview questions and detailed answers.