InterviewStack.io LogoInterviewStack.io

Systematic Troubleshooting and Debugging Questions

Covers structured methods for diagnosing and resolving software defects and technical problems at the code and system level. Candidates should demonstrate methodical debugging practices such as reading and reasoning about code, tracing execution paths, reproducing issues, collecting and interpreting logs metrics and error messages, forming and testing hypotheses, and iterating toward root cause. Topic includes use of diagnostic tools and commands, isolation strategies, instrumentation and logging best practices, regression testing and validation, trade offs between quick fixes and long term robust solutions, rollback and safe testing approaches, and clear documentation of investigative steps and outcomes.

MediumTechnical
31 practiced
A scheduled Airflow task fails with ModuleNotFoundError: 'analytics.transforms' even though the DAG parses. Upstream tasks succeeded. Describe how you would debug packaging and dependency issues in Airflow: inspect worker Python environment, container image tags, volume mounts, requirements files, and DAG import behavior. Explain fixes in a KubernetesExecutor environment and describe how to ensure a consistent runtime across all workers.
EasyTechnical
39 practiced
You are alerted that the nightly job produced zero rows for date '2025-11-29' in table analytics.events. Outline the steps you would take to reproduce the issue locally and to validate raw source data, connector health, ingestion offsets, transformation logic, and sink behavior. List specific commands or tools you would run (for example: aws s3 ls, gsutil ls, kafka-consumer-groups.sh, spark-submit with local mode, Airflow task test) and how you'd isolate whether the failure is in the source, ingestion connector, transformation, or sink.
MediumTechnical
36 practiced
Explain how you'd implement runtime data assertions in a pipeline. Example assertions: 'user_id NOT NULL', 'email matches regex', 'timestamp within last 30 days'. Where in the pipeline would you run these validations (ingest, transform, sink), how would you handle failing records (reject, quarantine, alert), and how would you surface assertion failures to developers with actionable diagnostics?
EasyTechnical
36 practiced
Explain common causes of schema drift in production data sources and describe detection and prevention techniques. Include schema validation, consumer-driven contracts, versioning strategies (e.g., Avro/Confluent schema registry), feature flagging for schema changes, and safe migration patterns like dual-read, backfill, and graceful evolution for optional fields.
MediumTechnical
36 practiced
You're getting java.lang.OutOfMemoryError on Spark executors during an aggregation. Given this PySpark snippet:
rdd = sc.textFile('s3://bucket/largefile')
pairs = rdd.map(parse).map(lambda x: (x.key, x.value))
result = pairs.groupByKey().mapValues(lambda vals: sum(vals)).collect()
Explain why this code can cause OOM and rewrite it to be memory-efficient. Discuss configuration changes (spark.executor.memory, spark.memory.fraction) and the trade-offs between reduceByKey, aggregateByKey, combineByKey and groupByKey.

Unlock Full Question Bank

Get access to hundreds of Systematic Troubleshooting and Debugging interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.