InterviewStack.io LogoInterviewStack.io

Systematic Troubleshooting and Debugging Questions

Covers structured methods for diagnosing and resolving software defects and technical problems at the code and system level. Candidates should demonstrate methodical debugging practices such as reading and reasoning about code, tracing execution paths, reproducing issues, collecting and interpreting logs metrics and error messages, forming and testing hypotheses, and iterating toward root cause. Topic includes use of diagnostic tools and commands, isolation strategies, instrumentation and logging best practices, regression testing and validation, trade offs between quick fixes and long term robust solutions, rollback and safe testing approaches, and clear documentation of investigative steps and outcomes.

MediumTechnical
0 practiced
Given two snapshots of a table:
snapshot_prev(date DATE, id STRING, value FLOAT)snapshot_curr(date DATE, id STRING, value FLOAT)
Write an ANSI SQL query that outputs rows categorized as 'added', 'removed', or 'changed', showing id, prev_value, curr_value. Also include total counts per category. Then describe strategies to scale this diff for 100M rows efficiently (partitioning, hashing, checksums, incremental comparisons) and how to validate diffs in a cost-effective way.
HardSystem Design
0 practiced
Design a strategy to correlate telemetry and traces across a data pipeline where some components (legacy systems or third-party services) cannot be instrumented. Explain how you'd propagate correlation IDs, create deterministic fingerprints for records, enrich logs at ingress/egress points, and store minimal contextual metadata to reconstruct end-to-end flows. Address storage overhead, privacy considerations, and how to query for root cause during an investigation.
HardTechnical
0 practiced
Discuss trade-offs between capturing full traces for all requests versus sampling traces in large-scale data systems. Cover storage and ingestion costs, overhead on production services, ability to detect tail-latency or rare failures, tail-sampling and adaptive sampling techniques, and how to stitch partial trace samples with logs and metrics to aid debugging.
HardSystem Design
0 practiced
Design a self-healing observability system for data pipelines that, when an alert fires, automatically runs a set of diagnostic checks, produces a ranked list of likely root causes, and optionally triggers low-risk mitigations (for example, restart a worker or throttle producers). Requirements: support 100k pipeline runs/day, identify top 3 probable causes within 2 minutes, keep privacy constraints, and limit auto-actions to low-risk fixes. Describe architecture, choice of rule-based vs ML-based diagnosis, data retention for historical incidents, safety guardrails, and monitoring for false positives of auto-actions.
MediumTechnical
0 practiced
A Spark ETL job that used to finish in 1 hour now takes 10 hours after a recent code change. Walk through a structured approach to diagnose the regression. Specify which Spark UI tabs and metrics you would inspect (stages, tasks, shuffle read/write, spilled memory on disk, GC times), which executor and OS-level metrics to check, and what targeted experiments you would run to isolate the change (for example, compare DAG plans, run smaller subsets, check data cardinalities).

Unlock Full Question Bank

Get access to hundreds of Systematic Troubleshooting and Debugging interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.