Site Reliability Engineering Fundamentals Questions

Covers foundational site reliability engineering concepts that interviewers expect all candidates to understand. Topics include Service Level Objectives and Service Level Indicators and how they relate to availability targets and measurable system health, the notion of error budgets and trade offs between velocity and reliability, incident management including detection, escalation, on call rotations, and blameless postmortems, the importance of monitoring and observability for alerting and root cause analysis, basic deployment and rollback strategies, and an automation mindset to reduce toil. Candidates should be able to explain these ideas at a conceptual level, discuss how they influence decision making, and reference common practices used to improve reliability.

HardTechnical

0 practiced

You lead an SRE team. Product wants to increase deployment frequency, but error budgets are frequently burned. Propose a balanced policy and process that allows product velocity while improving reliability: include short-term mitigations, medium-term engineering investments, and organizational changes.

HardSystem Design

0 practiced

Design an automated canary analysis pipeline. Describe components: deployment orchestration, traffic split, metric collection, statistical tests (e.g., t-test, Mann-Whitney, Bayesian), decision logic for pass/fail, and automated rollback. Explain how you avoid false positives in noisy metrics.

MediumTechnical

0 practiced

Draft a concise post-incident review template that includes timeline, impact, root cause analysis, contributing factors, actions, owner, and verification steps. Explain which parts must be completed within 24 hours and which can be deferred, and why.

MediumTechnical

0 practiced

Write a Prometheus alert rule (PromQL) that fires when the 5-minute error rate for a service exceeds 1% (errors/total) and where the alert groups by service and environment. Also describe a recording rule that simplifies the query for alerting.

HardTechnical

0 practiced

Given this simplified trace snippet (JSON-like), identify the probable bottleneck and describe follow-up instrumentation you would add:

{ "trace_id": "t1", "spans": [ {"id":"s1","service":"frontend","duration_ms":50}, {"id":"s2","service":"auth","duration_ms":10}, {"id":"s3","service":"search","duration_ms":800}, {"id":"s4","service":"db","duration_ms":100} ]}

Explain how sampling or aggregation might hide this issue and how to improve trace fidelity.

Unlock Full Question Bank

Get access to hundreds of Site Reliability Engineering Fundamentals interview questions and detailed answers.

Join thousands of developers preparing for their dream job.