Systematic Troubleshooting and Debugging Questions

Covers structured methods for diagnosing and resolving software defects and technical problems at the code and system level. Candidates should demonstrate methodical debugging practices such as reading and reasoning about code, tracing execution paths, reproducing issues, collecting and interpreting logs metrics and error messages, forming and testing hypotheses, and iterating toward root cause. Topic includes use of diagnostic tools and commands, isolation strategies, instrumentation and logging best practices, regression testing and validation, trade offs between quick fixes and long term robust solutions, rollback and safe testing approaches, and clear documentation of investigative steps and outcomes.

HardSystem Design

0 practiced

Design a safe migration and rollback plan for changing the partitioning scheme and primary key of a petabyte-scale analytics table in a data warehouse (for example, BigQuery or Redshift). Requirements: minimize downtime, support reads during migration, allow validation of backfilled data, and enable rollback to the previous format within 24 hours if issues are found. Describe dual-writing, shadow tables, incremental backfills, reconciliation queries, and cost considerations.

MediumTechnical

0 practiced

A Spark ETL job that used to finish in 1 hour now takes 10 hours after a recent code change. Walk through a structured approach to diagnose the regression. Specify which Spark UI tabs and metrics you would inspect (stages, tasks, shuffle read/write, spilled memory on disk, GC times), which executor and OS-level metrics to check, and what targeted experiments you would run to isolate the change (for example, compare DAG plans, run smaller subsets, check data cardinalities).

HardTechnical

0 practiced

Given this PySpark code that performs a join and then collects results:

left = spark.read.parquet('s3://data/large_left')
right = spark.read.parquet('s3://data/small_right')
joined = left.join(right, on='user_id')
result = joined.groupBy('country').agg({'amount':'sum'}).collect()

Assume 'small_right' was mislabeled and is actually 100M rows, which causes executor OOMs. Explain why this triggers memory issues and provide a rewritten approach to compute the aggregation safely, considering broadcast hints, repartitioning, map-side aggregation, salting for skew, and avoiding collect(). Provide code or pseudocode and justify your changes.

HardTechnical

0 practiced

You detect silent data corruption (bit flips) in Parquet files stored in cloud object storage. Describe a plan to detect and verify corruption across datasets, repair corrupted data, and prevent recurrence. Include the role of checksums, file footers, cross-region replication checks, versioned objects, reprocessing from source, and the trade-offs between storage/compute cost and data integrity.

MediumTechnical

0 practiced

A scheduled Airflow task fails with ModuleNotFoundError: 'analytics.transforms' even though the DAG parses. Upstream tasks succeeded. Describe how you would debug packaging and dependency issues in Airflow: inspect worker Python environment, container image tags, volume mounts, requirements files, and DAG import behavior. Explain fixes in a KubernetesExecutor environment and describe how to ensure a consistent runtime across all workers.

Unlock Full Question Bank

Get access to hundreds of Systematic Troubleshooting and Debugging interview questions and detailed answers.

Join thousands of developers preparing for their dream job.