InterviewStack.io LogoInterviewStack.io

System Monitoring and Performance Tuning Questions

Operational monitoring and continuous tuning of system and infrastructure resources to maintain performance and reliability. Topics include key system health and performance metrics such as central processing unit usage memory utilization disk input output and latency network bandwidth process counts system load latency and throughput and queries per second, establishing baselines and normal ranges, anomaly detection and root cause triage, instrumentation and metric collection for system health, reading monitoring dashboards and recognizing common failure patterns, interpreting system logs and using diagnostic commands and tools, setting alert thresholds and prioritization and escalation pathways, capacity planning and remediation steps, resource tuning to remove bottlenecks, and knowing when to escalate to deeper engineering investigation. Candidates should be able to connect observed symptoms to likely causes describe basic troubleshooting workflows and propose mitigation and prevention measures.

HardTechnical
85 practiced
Network troubleshooting: you see 1% packet loss and excessive TCP retransmits between datacenters. Describe a set of tests and captures you would perform (including tcpdump, iperf, mtr, and host-side counters), exactly what signatures you would look for in packet captures, and how you would determine if the problem is congestion, faulty hardware, or a misconfigured load balancer.
EasyTechnical
63 practiced
Explain the difference between paging and swapping on modern operating systems, and discuss how each affects performance. Describe two metrics or commands you would use to detect excessive paging or swapping on Linux and what remediation steps you might take.
MediumTechnical
47 practiced
Technical-coding: Write a simple Python 3 script that reads /proc/stat and /proc/meminfo, computes instant CPU utilization percentage and memory used percentage, and prints two Prometheus-style metric lines named node_cpu_percent and node_memory_percent with labels 'host' set to the machine hostname. Use only the Python standard library.
HardTechnical
55 practiced
Technical-coding/runbook: Provide a safe, idempotent bash script template (pseudocode acceptable) that an on-call engineer can run to triage a high disk I/O incident. The script should perform read-only checks: gather iostat/xstat, identify top io-consuming processes, capture recent dmesg lines, and output a prioritized short report of findings. The script must avoid writing to disks and must include comments describing each step.
EasyBehavioral
61 practiced
Behavioral: Tell me about a time you were on-call and resolved a performance incident. Use the STAR format: briefly describe the Situation, your Task, the Actions you took to triage and resolve the incident, and the Results including how you prevented recurrence. Focus on technical decisions and communication with stakeholders.

Unlock Full Question Bank

Get access to hundreds of System Monitoring and Performance Tuning interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.