Meta Site Reliability Engineer (Junior Level) Interview Preparation Guide

Site Reliability Engineer (SRE)

Interview Rounds

Recruiter Screening

20 min3 focus topicsculture fit

What to Expect

Initial screening call with a Meta recruiter to assess fit, understand your background, verify interest in the SRE role, and provide details about the position. This is a conversational round focused on your career trajectory, motivation for joining Meta, and logistical details about the interview process. The recruiter will also answer your questions about the role and team. This is your opportunity to demonstrate enthusiasm for reliability engineering and your understanding of what the role entails.

Tips & Advice

Be authentic about your interest in SRE. Prepare 2-3 questions about the role, team structure, and current challenges they're facing with system reliability. Research Meta's infrastructure and mention relevant products or technical achievements that excite you. Have your elevator pitch ready: why you're transitioning to/pursuing SRE, what aspects excite you most, and why Meta specifically. Keep answers concise—this is not a deep technical round.

Focus Topics

Understanding of SRE Role at Meta

Demonstrate awareness of what Meta's SRE team does, the scale of problems they solve, and how SRE contributes to Meta's mission. Show you've researched Meta's engineering culture and reliability challenges.

Practice Interview

Study Questions

Professional Background and Relevant Experience

Summarize your work experience, focusing on projects and roles that demonstrate systems thinking, operational work, or infrastructure contributions. Highlight any experience with automation, monitoring, incident response, or supporting production systems.

Practice Interview

Study Questions

Career Motivation and SRE Interest

Articulate why you're interested in Site Reliability Engineering as a career path, what aspects of the role appeal to you, and why you believe Meta is the right next step. Discuss specific experiences that sparked your interest in reliability, scalability, or operations work.

Practice Interview

Study Questions

Technical Phone Screen 1: Fundamentals and Tools

45 min4 focus topicstechnical

What to Expect

First technical screening call with an SRE engineer from Meta. This round evaluates your foundational knowledge of SRE concepts, familiarity with monitoring and observability tools, and understanding of basic operational practices. Expect questions about how monitoring systems work, what metrics matter for reliability, and your hands-on experience with infrastructure tools. This is not a deep coding round but may involve discussing shell scripting or automation approaches at a high level.

Tips & Advice

Focus on explaining concepts clearly using real examples from your past work. When discussing tools, be specific about your hands-on experience: which tools have you used, in what context, and what problems did they help you solve? For junior-level, demonstrate practical understanding rather than theoretical perfection. If asked about a concept you're not familiar with, acknowledge it honestly and explain your approach to learning new tools. Prepare a concise explanation of a past incident you dealt with and how you debugged it.

Focus Topics

Automation and Infrastructure Tools

Discuss your experience with automation: shell scripting, configuration management tools (Terraform, Ansible, Chef), infrastructure-as-code, or CI/CD tools. Explain a repetitive operational task you've automated and the impact it had. For junior level, focus on practical examples rather than advanced optimization.

Practice Interview

Study Questions

Monitoring and Alerting Fundamentals

Understand the core principles of monitoring: what metrics matter (latency, error rate, saturation), types of alerts (threshold-based, anomaly detection, composite alerts), and how to avoid alert fatigue. Be prepared to discuss tools like Prometheus, Grafana, Datadog, or New Relic. Explain the difference between monitoring and observability.

Practice Interview

Study Questions

Basic System Observability and Debugging

Explain the three pillars of observability: metrics, logs, and traces. Discuss tools you've used for troubleshooting (command-line tools, log aggregation, APM tools). Describe your approach to debugging a performance issue or production incident: how you'd gather data, form hypotheses, and narrow down root causes.

Practice Interview

Study Questions

SLOs, SLIs, and Error Budgets

Define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs). Explain how error budgets work and how they guide decision-making about reliability improvements versus new feature development. Provide an example of how you'd define an SLO for a service.

Practice Interview

Study Questions

Technical Phone Screen 2: Incident Response and Problem-Solving

45 min4 focus topicstechnical

What to Expect

Second technical screening call typically with a different SRE engineer. This round assesses your incident response capabilities, troubleshooting mindset, and how you approach production problems. Expect scenario-based questions like 'How would you debug a slow database?' or 'What would you do if a service started returning 500 errors?' You may be asked to walk through a past incident, discuss how you'd set up monitoring for a given scenario, or explain your approach to capacity planning.

Tips & Advice

Prepare 2-3 detailed incident stories from your past work using the STAR method. Focus on what you learned and how you improved the system afterward, not just on fixing the immediate problem. When given a scenario, think out loud: ask clarifying questions, explain your debugging approach step-by-step, and discuss trade-offs. For junior level, demonstrating a logical troubleshooting process is more important than having the perfect answer. Show you understand that incident response involves collaboration and communication, not just technical problem-solving.

Focus Topics

On-Call Responsibilities and Toil Management

Discuss your understanding of on-call rotations, escalation procedures, and runbooks. Explain how you'd balance responding to incidents with reducing toil through automation. Discuss the challenge of context-switching and how you'd minimize alert fatigue.

Practice Interview

Study Questions

Performance Optimization and Capacity Planning

Discuss how you identify performance bottlenecks. Explain concepts like resource utilization, headroom, and when to scale. Describe how you've optimized a system for performance: database queries, caching, infrastructure scaling, etc. Discuss the trade-offs between performance and cost.

Practice Interview

Study Questions

Root Cause Analysis and Post-Incident Reviews

Explain how you identify root causes versus symptoms. Discuss the concept of blameless post-mortems and why they're valuable. Describe how you'd document an incident and extract learnings. Discuss preventive measures: how you'd ensure the same issue doesn't recur.

Practice Interview

Study Questions

Incident Response and Troubleshooting Methodology

Understand the incident response process: detection, triage, mitigation, resolution, and post-incident review. Explain your approach to troubleshooting: gathering data, forming hypotheses, testing them, and implementing fixes. Discuss the importance of communication during incidents and how you'd coordinate with other teams.

Practice Interview

Study Questions

Onsite Round 1: Technical Depth - System Reliability Concepts

60 min4 focus topicstechnical

What to Expect

First onsite round focused on deeper technical understanding of system reliability, distributed systems basics, and architectural concepts. The interviewer will ask questions about how systems fail, redundancy, consistency models, and how to design for reliability. You may be asked to discuss a system you've worked with, identify potential failure modes, and explain how you'd mitigate them. This round bridges toward system design thinking but remains grounded in reliability principles rather than full system design.

Tips & Advice

Come prepared with a real system you know well—explain its architecture, dependencies, and potential failure points. For a junior-level candidate, focus on demonstrating understanding of reliability principles (redundancy, failover, circuit breakers, graceful degradation) and applying them to real systems. Don't worry about perfect system design patterns; focus on explaining your thinking clearly and showing how reliability concerns influence architectural decisions. Ask clarifying questions if the interviewer introduces a hypothetical system.

Focus Topics

Scalability and Resource Management

Discuss vertical versus horizontal scaling. Explain how you'd identify scalability bottlenecks and plan for growth. Discuss container orchestration basics (Kubernetes concepts like pods, services, deployments) and how they support reliability and scalability. Explain resource limits, autoscaling policies, and capacity planning.

Practice Interview

Study Questions

Disaster Recovery and Business Continuity

Explain disaster recovery concepts: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Discuss backup strategies, multi-region redundancy, and data replication. Explain how you'd test recovery procedures. Discuss the trade-offs between disaster recovery investment and risk tolerance.

Practice Interview

Study Questions

Deployment and Rollback Strategies

Discuss deployment approaches: blue-green deployments, canary releases, rolling updates. Explain how you'd manage rollbacks and detect bad deployments. Discuss monitoring during deployments and the role of SREs in ensuring safe, reliable deployments. Explain trade-offs between deployment speed and safety.

Practice Interview

Study Questions

Distributed Systems and Failure Modes

Understand common failure modes in distributed systems: network partitions, cascading failures, data consistency issues, resource exhaustion. Discuss how you'd design systems to tolerate these failures: redundancy, isolation, timeouts, circuit breakers, bulkheads. Explain concepts like eventual consistency, quorum-based replication, and health checks.

Practice Interview

Study Questions

Onsite Round 2: Observability and Monitoring Architecture

60 min4 focus topicstechnical

What to Expect

This round focuses on observability, monitoring system design, and instrumentation strategy. The interviewer will discuss how to design monitoring and alerting for a service, what metrics and logs to collect, how to structure dashboards for different audiences, and how to detect problems early. You may be asked to design monitoring for a hypothetical service or discuss how you'd improve monitoring for a system you've worked with. This demonstrates practical understanding of observability as a reliability tool.

Tips & Advice

Come prepared with specific examples of monitoring you've implemented or improved. Discuss what you've learned about which metrics matter most and which alerts actually fire meaningfully versus generating noise. For junior level, focus on practical monitoring decisions: why you chose certain metrics, how you structured alerts, and how you collaborated with dev teams on instrumentation. If you haven't worked extensively with monitoring, discuss your approach to learning and implementing a monitoring system from scratch.

Focus Topics

Dashboard Design and Observability for Different Audiences

Discuss how to design dashboards for different purposes: on-call engineer dashboards for quick triage, business dashboards for stakeholders, SLO dashboards for tracking reliability. Explain how you'd structure dashboards for usability. Discuss the balance between too much information and not enough.

Practice Interview

Study Questions

Monitoring Tools and Infrastructure

Discuss experience with monitoring stacks like Prometheus/Grafana, Datadog, New Relic, or others. Explain architectures for collecting metrics at scale. Discuss time-series databases and their characteristics. Explain how you'd set up monitoring for containers and microservices. Discuss monitoring in cloud environments.

Practice Interview

Study Questions

Metrics, Logs, and Traces Strategy

Explain the three pillars of observability and their purposes. Discuss which metrics are most important for reliability (error rate, latency, saturation, availability). Explain how to structure logs for searchability and debugging. Discuss distributed tracing and why it matters in microservices environments. Explain trade-offs in data collection, retention, and cost.

Practice Interview

Study Questions

Alerting Strategy and Preventing Alert Fatigue

Discuss how to design meaningful alerts that catch real problems without creating noise. Explain alert thresholds, composite alerts, and anomaly detection. Discuss alert routing, escalation policies, and on-call workflows. Explain how you'd measure alert quality: did it catch something important, or was it a false positive?

Practice Interview

Study Questions

Onsite Round 3: Automation and Infrastructure-as-Code

60 min4 focus topicstechnical

What to Expect

This round focuses on automation, infrastructure-as-code, and tooling for operational efficiency. The interviewer will discuss how to automate repetitive tasks, infrastructure provisioning, configuration management, and CI/CD pipelines. You may be asked to discuss a repetitive process you've automated, explain infrastructure-as-code concepts, or design automation for a given scenario. This round tests your ability to reduce toil and scale operational work.

Tips & Advice

Prepare concrete examples of automation you've implemented: a script you wrote, a configuration management setup, or a CI/CD improvement. Explain the business value: what did this automation achieve in terms of time saved, error reduction, or reliability improvement? For junior level, focus on practical, impactful automation rather than overly complex systems. Discuss your approach to learning automation tools—most interviews value problem-solving and learning ability as much as existing expertise. Be honest about what you haven't done but show curiosity about these areas.

Focus Topics

Scripting and Programming for Operational Tasks

Discuss your programming or scripting experience relevant to operations: Python, Go, Bash, or others. Explain how you approach writing scripts for operational tasks. Discuss maintainability, error handling, and logging in operational code. Explain your approach to learning new languages for operations work.

Practice Interview

Study Questions

CI/CD Pipelines and Deployment Automation

Explain CI/CD concepts: continuous integration, continuous deployment/delivery. Discuss tools like Jenkins, GitHub Actions, GitLab CI, or similar. Explain how to safely automate deployments: testing, deployment gates, rollback mechanisms. Discuss the role of SREs in ensuring deployment reliability and speed.

Practice Interview

Study Questions

Infrastructure-as-Code and Configuration Management

Understand principles of infrastructure-as-code: versioning infrastructure, reproducibility, idempotency. Discuss tools like Terraform, Ansible, CloudFormation, or similar. Explain how to manage infrastructure changes safely and audit who made changes. Discuss the benefits: faster provisioning, disaster recovery, consistent environments. Discuss trade-offs in complexity and learning curve.

Practice Interview

Study Questions

Automation for Toil Reduction

Define toil: repetitive, manual, operational work. Discuss how you identify toil and prioritize automation efforts. Provide examples of toil you've reduced through automation. Explain the cost-benefit analysis of automation: when is it worth automating versus accepting manual work? Discuss the impact on on-call experience and team productivity.

Practice Interview

Study Questions

Onsite Round 4: Behavioral and Team Collaboration

45 min4 focus topicsbehavioral

What to Expect

Final onsite round focused on behavioral assessment, communication skills, teamwork, and cultural fit with Meta. The interviewer will use behavioral questions to understand how you handle challenges, collaborate with teams, learn from failures, and approach responsibilities. Expect questions about past experiences with conflict resolution, working across teams, learning new systems, and your approach to continuous improvement. This round assesses whether you're a good team member and aligned with Meta's engineering culture.

Tips & Advice

Prepare 5-6 detailed stories using the STAR method (Situation, Task, Action, Result) that showcase collaboration, learning, problem-solving, and handling adversity. For junior-level candidates, focus on stories that demonstrate: willingness to learn, asking for help appropriately, collaborating with teammates, taking ownership within your scope, and learning from mistakes. Emphasize team success over individual accomplishment. Be authentic and honest—interview conversations should feel like natural discussion, not recited answers. Ask thoughtful questions about the team's culture, how they handle incidents, and what support junior members receive.

Focus Topics

Taking Ownership and Accountability

Provide examples of projects or responsibilities you took ownership of as a junior team member. Discuss how you ensured quality and communicated progress. Explain your approach to asking for help when needed versus trying to solve everything alone. Describe how you handle situations where you don't know the answer.

Practice Interview

Study Questions

Handling Failure and Incident Response Communication

Discuss a significant incident or failure you experienced: how you handled it, what you learned, and how you prevented recurrence. Emphasize blameless post-mortem approach and psychological safety. Discuss how you communicate during stressful situations. Explain your approach to taking responsibility without making excuses.

Practice Interview

Study Questions

Collaboration and Cross-Functional Teamwork

Discuss your experience working with development teams, operations teams, and other functions. Explain how you communicate technical concepts to non-technical stakeholders. Describe a situation where you resolved conflict or misalignment between teams. For SREs, emphasize partnership with developers on reliability: how you collaborate on SLOs, incident response, and improving system design for reliability.

Practice Interview

Study Questions

Learning and Growth Mindset

Discuss your approach to learning new systems, tools, and domains. Provide an example of a challenging concept you learned and how you approached it. Discuss what you don't know and how you identify and fill knowledge gaps. Explain your approach to staying current with infrastructure and reliability trends.

Practice Interview

Study Questions

Frequently Asked Site Reliability Engineer (SRE) Interview Questions

Alert Design and Fatigue ManagementHardBehavioral

41 practiced

Tell me about a time you led a cross-organizational change to alerting standards that initially faced resistance. Describe how you aligned stakeholders, overcame objections, implemented the change, and measured adoption and impact. If you don't have a direct example, outline a detailed plan you would follow to run such a change.

Sample Answer

Situation: At my previous company the alerting landscape was chaotic — each team had its own thresholds and noisy alerts. Incidents were missed because signal was drowned in noise, and paging burnout was high. Leadership wanted a unified alerting standard tied to SLOs, but many teams resisted, worried about losing control and increasing time-to-detect.

Task: As the SRE lead for platform reliability, I needed to create cross-organizational alerting standards, get buy-in from product and engineering teams, implement them across services, and demonstrate improved incident signal-to-noise and SLO compliance.

Action:- Built credibility first: collected baseline metrics (avg alerts/day, % actionable, MTTR) across 25 services to show the problem.- Convened a working group of reps from SRE, dev teams, on-call, and product to co-create principles: SLO-driven alerts, severity definitions, alert ownership, and runbook linkage.- Proposed a minimum viable standard document and an opt-in pilot with 5 services (varied criticality).- Addressed objections by mapping impacts: for devs concerned about lost control we added a documented exception process and retained per-service tuning within standard guardrails.- Implemented enforcement via CI linting for alert config and dashboards that surfaced alert-to-SLO mappings; provided templates and a two-week pairing program where SREs helped teams migrate rules.- Ran 4-week pilot, iterated based on feedback, then rolled out in waves with training, office hours, and an FAQ.

Result:- Within 8 weeks alert volume dropped 48% company-wide; actionable alerts as a percent of total rose from 22% to 66%.- MTTR improved 35% and paging burnout survey score improved 1.4 points (5→6.4/7).- All critical services had SLO-linked alerts and documented runbooks within 10 weeks.- Teams reported faster incident triage and fewer disrupted on-call shifts.

Learnings:- Co-creation and early pilots win trust faster than mandates.- Combine technology (CI checks, dashboards) with human processes (pairing, exceptions).- Measure and communicate concrete metrics frequently to sustain adoption.

Incident Leadership and PostmortemsHardTechnical

29 practiced

Case study: A major ecommerce outage during peak shopping causes high revenue loss and public attention. Walk through the incident lifecycle end-to-end: detection, immediate mitigations, trade-offs you would consider that affect revenue and customer trust, communication with stakeholders, legal and compensation considerations, and how you would structure the postmortem to drive business-aligned fixes.

Sample Answer

Situation: During peak shopping (Black Friday-equivalent) our primary checkout service went down for 45 minutes causing failed transactions, degraded site performance, and major revenue loss + social media attention.

Detection:- Multiple alerts triggered: elevated 5xx rates from API gateway, payment gateway timeouts, user-journey synthetic checks failing, and spike in error budget burn rate.- I correlate metrics (APM traces, metrics, logs) to identify the breakpoint: a cache eviction storm due to misconfigured TTL after a deployment, cascading DB connection exhaustion.

Immediate mitigations (first 0–30 minutes):- Execute pre-approved runbook steps to reduce blast radius: rollback the problematic deployment, disable the feature flag, and redirect traffic to healthy region via load balancer failover.- Throttle non-essential background jobs and bulk API consumers to free DB connections.- Open an incident channel, assign roles (incident commander, communications lead, SRE squads, dev on-call).- Trade-offs made: prioritise availability over freshness — we served slightly stale product/pricing caches to keep checkout functional. This caused minor price staleness risk but prevented total outage; chosen because revenue impact of downtime >> risk of a small pricing discrepancy.

Trade-offs affecting revenue & trust:- Speed vs correctness: rollback vs hotfix. Rollback safer and faster to restore revenue; may reintroduce a prior bug but limited risk.- Transparency vs legal exposure: early public acknowledgment builds trust but may invite scrutiny; coordinate messaging with legal/PR to balance candor and controlled details.- Compensation vs precedent: offering blanket refunds/credits reduces immediate customer anger but sets expectation. Prefer targeted compensation for affected sessions plus public apology.

Communication with stakeholders:- Internal: hourly executive brief with current impact metrics (transactions/minute, error rate, estimated lost revenue), mitigation steps, and ETA for restoration. Real-time updates in incident channel for engineers.- External: within first hour publish a brief status on status page and social channels: acknowledge outage, confirm teams working on fix, and promise updates. After stabilization, publish root-cause summary and remediation plan.- Customers: targeted email to affected customers with explanation and compensation options once we have scope.

Legal and compensation considerations:- Loop in Legal and Finance early to assess contractual obligations (SLAs, payment processor constraints), regulatory reporting requirements (payments, data breaches), and potential chargeback risk.- Define compensation tiers: refund for failed payments, promo credit for abandoned carts > threshold, expedited support for high-value customers. Ensure compensation is tracked and reconciled.

Postmortem structure to drive business-aligned fixes:- Executive summary: timeline, customer & revenue impact, and top-line root cause.- Timeline: second-by-second incident timeline with detection, decisions, mitigations, and communications.- Root cause analysis: technical cause with evidence (traces, logs), contributing factors (deploy process gap, missing chaos tests, insufficient connection pooling limits).- Impact analysis: number of affected sessions, estimated revenue loss, customer sentiment (social metrics), legal exposure.- Corrective actions (short-, medium-, long-term) mapped to owner, priority, SLO impact, and business value. Examples: - Short: improve runbook coverage for cache-related rollbacks; increase DB connection pool limits; add circuit-breakers. - Medium: automated canary + traffic shaping for deploys; synthetic checkout tests from multiple regions; emergency compensation automation. - Long: resilient architecture changes (connection pooling libraries, isolated checkout service instances), stronger SLOs and error budgets.- Prevent recurrence metrics: define measurable KPIs (reduced MTTR by X, no repeat cache-eviction incidents in 6 months), deadlines, and quarterly review.- Blameless retrospective and follow-up: assign owners, track actions to completion, and present progress to execs in 30/60/90 days.

This approach balances rapid recovery, minimizing revenue loss, preserving customer trust through transparent communication, and driving prioritized, measurable fixes aligned to business risk.

Infrastructure as Code and Configuration ManagementEasyBehavioral

34 practiced

Tell me about a time you automated an operational task using IaC or configuration management. Describe the original problem, the automation you built, how you validated it (tests and rollout), and the quantitative or qualitative impact on reliability, availability, or team toil.

Deployment and Release StrategiesMediumSystem Design

98 practiced

Design a CI/CD pipeline for a multi-service monorepo that supports feature branches, automated tests, artifact promotion, gated deployments, and emergency rollback. Specify how you would store artifacts, ensure reproducible builds, and support both scheduled and on-demand canary rollouts.

Sample Answer

Requirements & constraints:- Multi-service monorepo, support feature branches, automated tests, artifact promotion, gated deployments, canary (scheduled + on-demand), emergency rollback, reproducible builds, secure pipeline.

High-level architecture:- Git monorepo (feature/* branches) → CI server (e.g., GitHub Actions/GitLab CI/Jenkins) → Artifact repository (e.g., Nexus/Artifactory/OCI registry) → CD controller (Argo CD/Spinnaker) → Kubernetes clusters (canary and prod) → Observability (Prometheus/Grafana, Loki, Jaeger) and policy engine (OPA/Gatekeeper).

Pipeline components & flow:1. Branch CI: - On push to feature branch, run deterministic builds in immutable build image (buildkite/docker-in-docker or Kaniko for containers). - Run unit tests, linters, SBOM generation, and reproducible build inputs (lockfiles, commit hash, build args). - Produce signed artifacts: container images with content-addressable tags (sha256) and metadata (commit, branch, build id). - Push to artifact repo into a staging namespace: images/{service}/{commit-sha}.

2. PR gating: - Run integration tests in ephemeral environment (namespaced k8s), security scans, and performance smoke tests. - Block merge unless checks pass via status checks.

3. Promotion & release: - Promotion job moves artifact from staging to candidate registry tag (e.g., vX.Y.Z-candidate) using artifact repo metadata and record provenance in a manifest store (GitOps repo or database). - Create a deployment manifest in GitOps repo referencing exact image digests.

4. Canary deployments: - CD (Argo/CD or Spinnaker) applies manifests to cluster canary namespace or uses traffic-splitting (Istio/Contour/Ingress) to route a % of traffic. - Support scheduled canaries: cron-driven promotion pipelines trigger canary rollout using the same manifests. - Support on-demand canaries: manual trigger in CD UI or via API.

5. Gated full rollout: - Automated health checks and SLO-based analysis (latency, error rate) run for a configurable validation window. - If metrics pass, automated promotion to full production gradually (increase traffic ramp). - If signals fail, pipeline automatically gates and triggers rollback.

6. Emergency rollback: - Maintain immutable history of previous production digests; rollback job in CD can restore prior manifests to 100% traffic within minutes. - Provide a one-click emergency rollback and automated circuit-breaker to cut traffic to a safe endpoint.

Reproducible builds & artifact storage:- Use content-addressable images (digest tags), lock dependencies, build from hermetic build images, store build metadata (commit, build args, SBOM, provenance) in artifact repo and a manifest DB (or GitOps repo).- Sign images (cosign) and verify signatures in CD before deploy.- Use immutable retention policies and GC for artifact repo; quarantine unpromoted artifacts.

Security & pipeline hardening:- Least-privilege service accounts for CI/CD, secrets in vault (HashiCorp/SealedSecrets), image vulnerability scanning, OPA policies to enforce approved registries and signature verification.- Audit logs for promotions/rollbacks.

Observability & verification:- Canary analysis using Prometheus alerts + lightweight anomaly detection (e.g., Kayenta or custom): automatically compare baseline vs canary on errors, p50/p95 latency, saturation.- Alert SRE on gating failures and provide automated remediation playbooks.

Trade-offs:- GitOps + declarative manifests gives strong auditability but needs discipline around manifest updates.- Argo CD for k8s-native, Spinnaker for complex multi-cloud strategies.

This design ensures reproducible, auditable artifacts, safe gated rollouts, scheduled and manual canaries, and quick emergency rollback—aligned with SRE reliability goals.

Collaboration With Engineering and Product TeamsMediumTechnical

109 practiced

A release introduced intermittent errors after deployment. Walk through how you'd coordinate a cross-functional incident review with product and engineering, define remediation steps, and translate findings into prioritized backlog items and possible SLO changes. Include how you'd track completion.

Sample Answer

Situation: After a production release we began seeing intermittent user-facing errors (5xx spikes and increased latency) affecting ~3% of requests; errors were intermittent so impact and root cause were unclear.

Coordination (first 24–48h):- Triage: convene a short incident call with on-call SRE, release owner (engineer), product manager, QA lead, and a representative from infra/CI. Share dashboards, error samples, logs, and timeline.- Scope & rollback decision: assess severity vs. business impact and decide immediate mitigation (hotfix, config change, canary rollback). If high risk, execute rollback while preserving logs/traces.- Assign roles: Incident commander, scribe, communications owner (status updates), and owners for data, code, and deploy.

Remediation steps:- Short-term: enable circuit-breakers, increase retries/backoff, rollout targeted rollback or patch to the failing subset, add temporary traffic routing to healthy instances.- Medium-term: patch root cause in code/config, add automated tests reproducing failure pattern, add guards in deployment pipeline (schema checks, feature flags).

Post-incident review (within 72h):- Blameless RCA meeting with product + engineering: present timeline, hypothesis testing, evidence (traces, flame graphs), and contributing factors (e.g., untested config combination, gradual rollout threshold).- Produce an action log with owner, description, acceptance criteria, and ETA.

Translating into backlog & SLO changes:- Create prioritized JIRA items: 1) Critical bug fix (P0) — owner, code PR, QA, deploy window 2) Monitoring/alerting improvements — instrument failing code paths, add fine-grained alerts to detect early signals 3) E2E/chaos tests to cover scenario (P1) 4) Deployment pipeline guardrails (P1)- For each item include test plans and rollback strategies.- SLO evaluation: analyze error budget impact; if recurring root cause shows gaps in observability or too-tight SLO, propose SLI refinements (split availability vs. latency), and SLO adjustment only if business/usage pattern changed. Usually prefer adding better monitoring and controls before relaxing SLOs.

Tracking completion:- Track tasks in a dedicated incident epic with status, owners, due dates, and verification steps. Use CI/CD tags to link PRs to the epic.- Verification: require production canary metrics to show error rate back to baseline for two rolling windows (e.g., 48h) before closure.- Weekly sync until all actions are closed; close the incident only after RCA action verification and retrospective notes published to the team and stakeholders.

Learnings: document clear deployment guardrails, update runbooks, and run a short post-mortem readout to engineering and product so the release process improves.

Automation and ScriptingHardTechnical

75 practiced

Propose a secure architecture for managing secrets and credentials at scale across multiple CI/CD systems and ephemeral agents. Cover short-lived certificates, OIDC, HSM/KMS usage, secret rotation, auditing and alerting for secret access, and how to provision minimal privileges to ephemeral agents.

Sample Answer

Requirements:- Centralized secret control across multiple CI/CD systems and ephemeral agents- Short-lived credentials, strong audit trail, automatic rotation, least privilege- Hardware-backed root of trust for key material

High-level architecture:- Root HSM (on-prem HSM or cloud HSM/KMS) holds master key; used only for signing/unwrap- Central Secret Broker service (SBS) backed by KMS/HSM that issues short-lived secrets/certs on demand- OIDC identity provider (IdP) for workload identities (CI/CD systems, agents)- Audit & SIEM pipeline ingesting broker/KMS logs, agent activity, alerts

Core flows:1) Authentication: Ephemeral agent obtains OIDC token from IdP (federated identity, attestation for e.g., GKE Workload Identity or cloud instance identity).2) Authorization: Agent calls SBS presenting OIDC token and a signed attestation (optional hardware attestation like TPM).3) Issuance: SBS validates token, checks policy, then requests KMS/HSM to generate or sign short-lived certificate/credential (TTL minutes-hours). Credentials returned to agent over mTLS.4) Use & revocation: Services accept certs validated against SBS-issued CA or via OCSP/CRL; SBS publishes revocation events.

Key components & choices:- HSM/KMS: Root keys isolated in HSM; rotate root keys with KMS key versioning; use Envelope Encryption for stored secrets.- Short-lived certs: Issue X.509 or JWTs with TTL ≤ 1 hour; prefer mutual TLS for service auth.- OIDC: Use OIDC token claims for audience, job id, build id; enforce token exchange (token lifetime minimal).- Ephemeral agents: Never store long-term secrets. Request secrets at runtime, use in-memory only, and destroy on exit. Use ephemeral containers with immutable images.- Privilege minimization: Use policy engine (Rego) in SBS to compute minimal scope—scoped roles, reduced privileges, time-bound access, single-use tokens.- Secret rotation: Automated rotation for stored secrets via SBS and KMS; for issued creds, rely on short TTL + automatic re-issue. Rotate long-lived KMS keys on policy schedule.- Auditing & alerting: Emit structured logs from SBS and KMS to SIEM (include subject, token, job id, source IP, requested scope). Alert on anomalous patterns: unusual requester, high frequency, credential reuse, failed attestation, access outside business hours.- Secrets in CI/CD: Integrate with runners via OIDC (no static secrets). For legacy integrations, provision scoped service accounts with HSM-protected keys and monitor usage.

Operational controls:- Rate-limit issuance, enforce MFA for high-scope requests, require hardware attestation for privileged credentials.- Regular pentests, key compromise drills, and runbooks for rapid revocation and CA rotation.- Compliance: retain audit logs immutable (WORM) and enforce separation of duties.

This design provides strong root-of-trust, minimizes blast radius via short-lived credentials and least privilege, centralizes policy, and ensures full auditability for SRE operations.

Alert Design and Fatigue ManagementEasyTechnical

37 practiced

You have an alert that pages when CPU usage on any web server exceeds 90% for 1 minute. Traffic spikes every evening due to scheduled batch jobs that do not affect user-facing latency. Describe how you would decide whether to keep, modify, or retire this alert. What alternative alert definitions would you propose that focus on user impact rather than raw CPU?

Sample Answer

Situation: We have an alert that pages when any web server CPU > 90% for 1 minute, but nightly scheduled batch jobs cause predictable spikes that don’t impact user-facing latency.

Decision approach (how I’d decide keep/modify/retire)- Verify actionability: check historical alerts and on-call toil. If nightly alerts repeatedly produce no remediation, they’re noisy and non-actionable.- Correlate with user impact: analyze metrics (p95 latency, error rate, request throughput) during CPU spikes. If user-facing metrics are unaffected, the alert is misaligned with customer impact.- Consider responsibility and runbook: who would respond and what would they do? If there’s no immediate remediation (spikes are expected), paging is unnecessary.- Determine alternatives and thresholds: decide whether a modified alert could be helpful (e.g., detect unexpected spikes outside scheduled windows or sustained cluster-wide saturation).

Decision options- Retire: if spikes are scheduled, isolated to batch jobs, and never affect users and there’s no feasible on-call action.- Modify: preferable when CPU can sometimes indicate real incidents. Reduce noise by adding context (duration, scope, correlation).- Keep (rare): only if CPU > 90% historically correlates with incidents that need immediate mitigation.

Concrete alternative alert definitions (focus on user impact)1) Composite user-impact alert (preferred): - Trigger when CPU > 90% AND p95 latency for service > X ms OR 5xx error rate increases by > Y% over baseline for 5m. - Rationale: pages only when CPU causes customer-visible degradation.

2) Sustained cluster saturation: - Trigger when > N instances have CPU > 80% for > 10 minutes (indicates capacity issue). - Rationale: single-host scheduled jobs shouldn’t page; multi-node sustained saturation is actionable.

3) Request-queue / thread saturation: - Trigger when request queue length or in-flight requests per instance > threshold, or when worker pool exhaustion occurs. - Rationale: directly maps to throughput/latency problems.

4) Anomaly / out-of-hours detection: - Suppress/auto-snooze during known batch windows, but alert on spikes outside scheduled windows or deviations from historical baseline (z-score). - Rationale: catches unexpected behavior while avoiding planned noise.

5) Process-level/tenant-aware CPU: - Alert when CPU used by webserver process (not batch job user) > threshold, or when batch jobs run on same hosts unsafely. - Rationale: isolates the component serving users.

Example pseudo-alert rule (PromQL-style)- Page if: (avg_over_time(node_cpu_seconds_total{job="web"}[5m]) > 0.9) and (increase(http_request_duration_seconds_bucket{le="0.99"}[5m]) > baseline*1.5 OR rate(http_requests_total{status=~"5.."}[5m]) > 0.01)

Operational follow-up- Add runbook entries (what to check: batch schedule, process list, latency, pods placement).- If modified, implement suppression during batch windows or tag instances running batches.- Monitor for a few weeks and re-evaluate: check reduction in noise and incident correlation.

Why this approach- Alerts should be actionable and tied to customer impact (SLOs). Raw CPU is a noisy signal; combining resource metrics with user-facing metrics or adding context reduces toil while preserving detection of real incidents.

Incident Leadership and PostmortemsHardTechnical

25 practiced

Provide a high-level design (pseudocode or Python skeleton) for a safe rollback orchestrator that performs transactional rollbacks across microservices respecting dependency ordering, supports dry-run, and handles partial failures with compensating actions. Focus on APIs, concurrency control, and failure handling rather than full implementation detail.

Sample Answer

Approach: model services as nodes in a dependency DAG, compute reverse topological order for rollback, and orchestrate per-service rollback transactions with concurrency where dependencies allow. Provide dry-run, idempotent APIs, optimistic concurrency control (version/checksum), timeouts, retries, and compensating actions logged and optionally auto-applied on partial failure.

python

from typing import List, Dict, Callable
import threading, time

class ServiceNode:
    def __init__(self, name: str, rollback_api: str, comp_action: Callable = None):
        self.name = name
        self.rollback_api = rollback_api  # endpoint to call for rollback
        self.compensate = comp_action     # local compensating function if needed
        self.version = None               # for optimistic concurrency

class RollbackResult:
    def __init__(self):
        self.success = []
        self.failed = []
        self.skipped = []

class Orchestrator:
    def __init__(self, graph: Dict[str, List[str]], nodes: Dict[str, ServiceNode], max_workers=10):
        self.graph = graph            # adjacency: service -> [dependencies]
        self.nodes = nodes
        self.max_workers = max_workers

    def _reverse_topo(self) -> List[str]:
        # produce an order where dependents are rolled back before their dependencies
        visited, order = set(), []
        def dfs(n):
            if n in visited: return
            visited.add(n)
            for dep in self.graph.get(n, []):
                dfs(dep)
            order.append(n)
        for n in self.nodes:
            dfs(n)
        return order  # dependents first

    def rollback(self, target_services: List[str], dry_run=False, timeout=30) -> RollbackResult:
        order = [s for s in self._reverse_topo() if s in target_services]
        result = RollbackResult()
        lock = threading.Lock()
        sem = threading.BoundedSemaphore(self.max_workers)

        def worker(svc_name):
            nonlocal result
            sem.acquire()
            node = self.nodes[svc_name]
            try:
                # Pre-check (optimistic concurrency)
                if not self._check_version(node):
                    with lock:
                        result.skipped.append((svc_name, "version_mismatch"))
                    return

                if dry_run:
                    with lock:
                        result.skipped.append((svc_name, "dry_run"))
                    return

                resp = self._call_rollback_api(node, timeout)
                if resp.get("status") == "ok":
                    with lock:
                        result.success.append(svc_name)
                else:
                    # attempt compensating action if defined
                    if node.compensate:
                        try:
                            node.compensate()
                        except Exception as e:
                            pass
                    with lock:
                        result.failed.append((svc_name, resp))
            finally:
                sem.release()

        threads = []
        for svc in order:
            t = threading.Thread(target=worker, args=(svc,))
            t.start()
            threads.append(t)
        for t in threads:
            t.join(timeout)  # avoid indefinite hang

        return result

    # Helpers (skeletons)
    def _check_version(self, node: ServiceNode) -> bool:
        # call service status API to get version/checksum and ensure it's rollbackable
        return True

    def _call_rollback_api(self, node: ServiceNode, timeout: int) -> Dict:
        # HTTP call with retries, circuit-breaker, idempotency token
        return {"status": "ok"}

# Usage hints:
# - Use reverse topo order to respect dependencies.
# - Concurrency: Bounded worker pool; only run independent services in parallel.
# - Failure handling: retries with backoff, compensating actions, and an audit log to allow manual remediation.
# - Idempotency: supply idempotency keys and optimistic version checks to avoid races.
# - Dry-run: simulate calls and validate ordering without making changes.

Key points:- Dependency-aware ordering via reverse topological sort.- Concurrency controlled with semaphore; safe parallelism only for independent services.- Optimistic concurrency/version checks + idempotent rollback endpoints reduce race conditions.- Dry-run mode and compensating actions for partial failures; audit logs and alerts for manual follow-up.- Extend with leader-election (distributed orchestrator), coordinator persistent state, and observability (tracing, metrics).

Infrastructure as Code and Configuration ManagementEasyTechnical

29 practiced

List common approaches for managing secrets in infrastructure-as-code workflows (examples: environment variables, encrypted state, HashiCorp Vault dynamic secrets, SOPS/SealedSecrets). For each approach, state pros, cons, and suitability for an SRE team managing production infrastructure.

Sample Answer

Common approaches for managing secrets in IaC workflows, with pros/cons and SRE suitability:

1) Environment variables (local CI/CD or container env)Pros: Simple, supported everywhere, easy for short-lived values.Cons: Risk of leakage in logs, process lists, build artifacts; hard to audit/rotate.Suitability: OK for ephemeral build-time values or non-prod; avoid for long-lived production secrets.

2) Encrypted state (Terraform remote state with encryption + access controls)Pros: Keeps IaC outputs/secrets centralized; integrates with Terraform workflows.Cons: If state contains secrets, any state compromise leaks them; rotation and fine-grained access are harder.Suitability: Use with server-side encryption + strict ACLs; avoid storing plaintext secrets—prefer references.

3) HashiCorp Vault (static & dynamic secrets)Pros: Dynamic credentials, strong audit logging, TTLs/auto-rotation, granular policies.Cons: Operational overhead (HA, backup), learning curve; needs network availability for deployments.Suitability: Strongly recommended for production SRE—best balance of security, rotation, and auditability.

4) SOPS / Git-crypt (encrypted files in repo)Pros: Secrets versioned with code, supports KMS (AWS/GCP/Azure) or PGP roots, CI-friendly decryption.Cons: Decrypted artifacts can leak in CI; key management complexity; rotation requires re-encrypting.Suitability: Good for storing config secrets tied to repo with strict CI handling; pair with ephemeral keys and limited access.

5) SealedSecrets / External Secrets (Kubernetes patterns)Pros: GitOps friendly, controller decrypts into K8s Secrets; safe to store sealed secret in Git.Cons: K8s Secret lifecycle (base64, in-memory) still risky; controller is an attack surface.Suitability: Appropriate when Kubernetes is primary platform and GitOps is used—combine with Vault for production-grade secret injection.

Recommendation for SRE teams: Prefer Vault (or cloud secrets manager with IAM) for production dynamic secrets; use SOPS/SealedSecrets for repo-backed configs; avoid plaintext in state or envs; enforce RBAC, audit logging, automated rotation and CI-safe decryption patterns.

Deployment and Release StrategiesEasyTechnical

71 practiced

Explain the blue-green deployment pattern in detail. Describe the architecture, a step-by-step rollout and rollback process, the benefits and common trade-offs. Specifically explain how you would handle session affinity, database state, and DNS/load-balancer switching in a production environment that has stateless web frontends and a single relational database.

Sample Answer

Blue–green deployment is a release pattern where two identical production environments exist: Blue (live) and Green (new). Traffic switches from Blue to Green when Green is validated, allowing instant rollback by flipping back.

Architecture:- Two identical stacks (app servers, configs) fronting a single relational DB.- Shared load balancer or DNS controls which environment receives traffic.- Stateless web frontends simplify switching; DB remains central.

Rollout steps:1. Build and smoke-test Green in staging-like environment.2. Deploy code to Green; run automated tests and health checks.3. Warm caches, run migration dry-runs if needed.4. Switch traffic: update load balancer target group (preferred) or shorten DNS TTL and change DNS to Green.5. Monitor metrics, logs, SLOs closely for a defined observation window.6. If healthy, decommission Blue or keep as fallback.

Rollback:- If issues, immediately switch load balancer back to Blue (instant). If DNS was used, revert DNS change and wait for TTL to propagate.

Handling specifics:- Session affinity: prefer sticky sessions only if necessary. Better: keep frontends stateless and store sessions in shared store (Redis) so switches are transparent. If stickiness used, use load-balancer-level sticky cookies that persist across flip or re-associate session store.- Database state: avoid incompatible schema changes. Use backward-compatible migrations (expand-then-contract): - Deploy additive schema changes first. - Deploy app that uses new fields. - Migrate data asynchronously. - Remove old columns in a later release. For destructive migrations, coordinate maintenance window and consider dual-write strategies with feature flags.- DNS vs load-balancer switching: use load-balancer target swaps (health-checked, immediate) for faster, reliable cutover. DNS is acceptable only with low TTL and when load-balancer control isn't available; it has propagation delay and caching risks.

Benefits:- Fast, low-risk rollback- Minimal downtime- Safe validation of new release in production

Trade-offs:- Doubled infra cost temporarily- Complexity in DB migrations and traffic management- Need robust automated testing and monitoring

Best practices:- Keep frontends stateless, use feature flags, automated health checks, runbooks for rollback, and gradual traffic shifting with canary checks when riskier changes are included.

Practice Site Reliability Engineer (SRE) questions across all topics

Additional Information

Want to create your own tailored preparation guide using our deep research?

Get Started for Free

Interview-Ready Courses

Visual-first, interactive, structured learning paths

Browse Site Reliability Engineer (SRE) jobs

AI-enriched listings across hundreds of company career pages

Explore Jobs

Meta Site Reliability Engineer (Junior Level) Interview Preparation Guide

Interview Process Overview

Interview Rounds

Recruiter Screening

What to Expect

Tips & Advice

Focus Topics

Understanding of SRE Role at Meta

Practice Interview

Study Questions

Professional Background and Relevant Experience

Practice Interview

Study Questions

Career Motivation and SRE Interest

Practice Interview

Study Questions

Technical Phone Screen 1: Fundamentals and Tools

What to Expect

Tips & Advice

Focus Topics

Automation and Infrastructure Tools

Practice Interview

Study Questions

Monitoring and Alerting Fundamentals

Practice Interview

Study Questions

Basic System Observability and Debugging

Practice Interview

Study Questions

SLOs, SLIs, and Error Budgets

Practice Interview

Study Questions

Technical Phone Screen 2: Incident Response and Problem-Solving

What to Expect

Tips & Advice

Focus Topics

On-Call Responsibilities and Toil Management

Practice Interview

Study Questions

Performance Optimization and Capacity Planning

Practice Interview

Study Questions

Root Cause Analysis and Post-Incident Reviews

Practice Interview

Study Questions

Incident Response and Troubleshooting Methodology

Practice Interview

Study Questions

Onsite Round 1: Technical Depth - System Reliability Concepts

What to Expect

Tips & Advice

Focus Topics

Scalability and Resource Management

Practice Interview

Study Questions

Disaster Recovery and Business Continuity

Practice Interview

Study Questions

Deployment and Rollback Strategies

Practice Interview

Study Questions

Distributed Systems and Failure Modes

Practice Interview

Study Questions

Onsite Round 2: Observability and Monitoring Architecture

What to Expect

Tips & Advice

Focus Topics

Dashboard Design and Observability for Different Audiences

Practice Interview

Study Questions

Monitoring Tools and Infrastructure

Practice Interview

Study Questions

Metrics, Logs, and Traces Strategy

Practice Interview

Study Questions

Alerting Strategy and Preventing Alert Fatigue

Practice Interview

Study Questions