InterviewStack.io LogoInterviewStack.io

Monitoring, Logging, and Operational Visibility Questions

Understand that running systems need constant visibility. Know basic monitoring concepts: metrics (numerical measurements like CPU, memory, request count), logs (detailed event records), and alerts (notifications when issues occur). Know the monitoring tools: CloudWatch (AWS), Azure Monitor (Azure), Cloud Operations/Stackdriver (GCP). Understand what should be monitored: application health (uptime, error rates), infrastructure health (CPU, memory, disk), and security events (access logs, permission denials). Know that proper monitoring enables quick issue detection and troubleshooting. Be familiar with dashboard creation (visualizing metrics) and alert configuration (notifying on problems). Understand log aggregation—collecting logs from multiple sources for centralized analysis.

EasyTechnical
72 practiced
Describe a centralized log aggregation pipeline for microservices that collects application JSON logs, node logs, and Kubernetes events. Include recommended agents (Fluentd/Fluent Bit/Logstash), buffering and backpressure handling, indexing basics, and retention strategies.
MediumTechnical
59 practiced
You receive many duplicate and low-importance alerts each week. Propose concrete strategies to reduce alert fatigue: include alert grouping and deduplication, symptom-first alerting, suppression during deploys, dynamic thresholds, and on-call ergonomics with examples and trade-offs.
EasyTechnical
70 practiced
Describe the main metric types used in modern monitoring systems: counter, gauge, histogram, and summary. For each type give a real SRE example of when to use it, discuss aggregation semantics and resetting behavior, and mention pitfalls to avoid.
HardTechnical
61 practiced
As platform SRE, define a policy for allocating and enforcing shared error budgets across 50 microservices owned by different teams. Include how you'd measure burn rate, automate release gating when a budget is exhausted, dispute resolution, and incentives for reliability improvements.
EasyTechnical
58 practiced
Define an 'alert' in an SRE context and explain how alerts differ from incidents and notifications. Describe a simple severity mapping (P0..P3) and what on-call actions each severity should trigger.

Unlock Full Question Bank

Get access to hundreds of Monitoring, Logging, and Operational Visibility interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.