Meta Staff Site Reliability Engineer Interview Preparation Guide

Site Reliability Engineer (SRE)

Interview Rounds

Recruiter Screening

45 min5 focus topicsbehavioral

What to Expect

The initial conversation with Meta's recruiting team to validate your background, assess role fit, and determine if your experience aligns with the Staff-level SRE position. This combines initial phone screening and recruiter follow-up into a single call. The recruiter will discuss your career trajectory, understanding of the SRE discipline, and motivation for joining Meta. This round covers logistics, compensation expectations, and timeline. For Staff-level candidates, expect deeper questions about your impact in previous roles, your approach to scaling teams and systems, and your vision for reliability engineering.

Tips & Advice

Be prepared to articulate your career story with emphasis on the progression to Staff level and the key technical and leadership milestones. Clearly explain what excites you about the SRE role at Meta and how your background specifically prepares you for this position. Have specific examples ready showing how you've improved reliability at scale, led infrastructure initiatives, and mentored engineers. Ask thoughtful questions about Meta's SRE organization, their current reliability challenges, and how this role contributes to Meta's engineering organization. Be honest about compensation expectations but avoid anchoring too early. Show enthusiasm for Meta's mission and technical challenges.

Focus Topics

Motivation & Fit for Meta

Clearly articulate why you're interested in Meta specifically and why this Staff SRE role aligns with your career goals. Research Meta's scale, their infrastructure challenges, and where your skills can create impact. Mention specific aspects of Meta's technology or mission that attract you.

Practice Interview

Study Questions

Understanding of SRE Discipline & Philosophy

Demonstrate understanding of SRE as a discipline—the philosophy of treating operations as an engineering problem, the importance of reliability versus feature velocity balance, error budgets, and the SRE toolkit. Show familiarity with industry concepts like SLI/SLO/SLA, toil automation, incident response, and postmortems.

Practice Interview

Study Questions

Mentorship & Team Development Approach

Discuss your approach to developing engineers—how you mentor senior engineers, develop talent, create learning opportunities, and contribute to team culture. For Staff level, this is critical to the role.

Practice Interview

Study Questions

Examples of Technical Leadership & Impact

Prepare 3-4 concrete examples of times you've made significant technical decisions at scale, led cross-functional initiatives, influenced architecture direction, or improved reliability metrics substantially. For Staff level, focus on examples that affected multiple teams or had organization-wide impact.

Practice Interview

Study Questions

Career Narrative & Staff-Level Progression

Articulate your career journey emphasizing how you reached Staff level expertise in SRE. Highlight key inflection points where your impact scaled beyond individual contributions to team and organizational level. Prepare to discuss the progression from IC to senior IC roles, growth in technical depth, and increasing scope of responsibility.

Practice Interview

Study Questions

Technical Phone Screen 1: Infrastructure & Systems Knowledge

60 min5 focus topicstechnical

What to Expect

A focused technical discussion with a Meta SRE engineer exploring your deep knowledge of large-scale infrastructure, systems architecture, and reliability practices. This 50-60 minute interview assesses your ability to think about complex infrastructure problems, design scalable systems, and make informed decisions about reliability trade-offs. The interviewer will present scenarios related to Meta's scale and ask you to discuss architectural choices, tradeoffs, and implementation considerations. Expect questions covering distributed systems concepts, infrastructure automation, monitoring, and capacity planning—all grounded in real-world challenges at Meta's scale.

Tips & Advice

Think out loud and explain your reasoning. For infrastructure problems, start by clarifying requirements and constraints. Discuss tradeoffs explicitly—reliability versus cost, consistency versus availability, automation complexity versus manual effort. Draw diagrams if helpful using a whiteboard or shared document. Show familiarity with monitoring and observability—discuss how you'd measure whether your solution works. Be specific about tools and technologies you'd use, such as Kubernetes, service mesh, and monitoring tools. Don't just describe what you would do; explain why. If you're unsure about something, say so and explain how you'd approach learning it. Show that you understand Meta's scale challenges—billions of users, real-time requirements, global infrastructure.

Focus Topics

Reliability Engineering Principles & SLOs

Deep understanding of SRE philosophy—error budgets, the tension between reliability and velocity, SLI/SLO/SLA definitions, and using these concepts to make engineering decisions. How to instrument systems to track reliability, how to use error budget to guide prioritization.

Practice Interview

Study Questions

Capacity Planning & Scalability

Predicting infrastructure needs based on growth, designing systems to scale linearly or better, understanding resource constraints, and planning expansions. Discuss how to forecast demand, identify bottlenecks, plan database scaling, network scaling, etc. Understanding growth metrics and their implications.

Practice Interview

Study Questions

Large-Scale System Architecture Design

Design systems that operate at Meta scale—billions of daily active users, global infrastructure, real-time requirements. Discuss multi-region architecture, failover strategies, consistency models under failure conditions, and architectural patterns for resilience. For Staff level, go deeper into architecture decisions, tradeoffs between consistency and availability, geographic distribution, and handling partial failures.

Practice Interview

Study Questions

Infrastructure Automation Frameworks & Tooling

Deep understanding of infrastructure automation—infrastructure as code using tools like Terraform or CloudFormation, configuration management, deployment automation, and rollback strategies. Discuss how to automate infrastructure changes safely, versioning approaches, testing automation, and handling state.

Practice Interview

Study Questions

Performance Monitoring & Observability Design

Designing comprehensive monitoring systems for large-scale distributed systems. Understanding metrics, logs, traces, and their tradeoffs. SLI/SLO instrumentation, alerting strategies that minimize false positives, and designing dashboards for different audiences. Knowledge of observability tools and how to architect monitoring for reliability.

Practice Interview

Study Questions

Technical Phone Screen 2: Incident Response & Troubleshooting

60 min5 focus topicstechnical

What to Expect

A technical interview focused on incident response, troubleshooting methodology, root cause analysis, and operational decision-making under pressure. The interviewer will present realistic incident scenarios from Meta's scale and ask how you'd respond—diagnostics, decision-making, communication, and resolution. This assesses your ability to remain calm under pressure, apply systematic troubleshooting, identify root causes, and drive incidents to resolution. For a Staff-level candidate, focus is also on how you'd lead response, mentor others during incidents, and conduct effective post-mortems.

Tips & Advice

Treat this like a real incident—ask clarifying questions about what's happening, what symptoms you're seeing, and what services/systems are affected. Work methodically through diagnosis rather than jumping to conclusions. Explain your thought process as you investigate. For each hypothesis, explain how you'd test it and what tools you'd use. Show systematic debugging approaches—understand logs, metrics, traces, customer impact. Don't fixate on one theory; generate and test multiple hypotheses. When you identify the root cause, discuss how to fix it quickly while minimizing risk. Think about rollback options and communication during incidents. For Staff level, discuss how you'd coordinate with other teams, communicate with leadership, and structure the response. Talk about post-incident review—what lessons would you capture, how would you prevent recurrence.

Focus Topics

Post-Incident Review & Learning Process

Designing effective post-incident reviews that extract lessons without blame. Identifying systemic issues versus one-time failures. Creating actionable follow-up work. Understanding how to use incidents as learning opportunities for the team.

Practice Interview

Study Questions

Communication & Coordination During Incidents

Communicating clearly during crisis situations—keeping stakeholders informed, avoiding false updates, coordinating across teams, managing customer communication. For Staff level, this includes leading the response, delegating work, and escalating appropriately.

Practice Interview

Study Questions

Decision Making & Trade-offs Under Pressure

Making sound technical decisions during incidents when under time pressure and with incomplete information. Balancing speed of recovery versus risk of making things worse. Knowing when to rollback versus rollforward, when to apply bandaids versus permanent fixes, and when to escalate.

Practice Interview

Study Questions

Incident Triage & Root Cause Analysis

Systematically diagnosing incidents at scale. Understanding how to triage severity, identify which systems are affected, correlate symptoms, and work toward root cause. Creating effective hypotheses and testing them methodically. Using logs, metrics, traces, and other signals to build understanding.

Practice Interview

Study Questions

Troubleshooting Methodology for Distributed Systems

Applying structured troubleshooting approaches to distributed systems where issues can be subtle and hard to reproduce. Understanding cascading failures, partial failures, and how to isolate problems. Knowledge of debugging tools and how to use them systematically.

Practice Interview

Study Questions

Onsite Round 1: Systems Design Interview

60 min6 focus topicssystem design

What to Expect

An in-depth systems design interview where you architect a large-scale system from scratch. You'll be given a system design problem likely related to Meta's infrastructure or services and asked to design a complete system considering reliability, scalability, performance, monitoring, and operational aspects. The interviewer will probe your design decisions, ask about tradeoffs, and explore how you'd handle edge cases, failures, and operational requirements. For a Staff-level candidate, expect higher bar—deeper discussion of reliability patterns, failure scenarios, operational automation, and how this system would be monitored and maintained at scale.

Tips & Advice

Start by clarifying requirements and constraints—scale, latency requirements, consistency requirements, failure modes you need to handle. Think about both the happy path and failure scenarios. Discuss reliability—what happens when components fail, how do you detect failures, how do you recover. Design monitoring and alerting as part of the system, not an afterthought. Think about operational concerns—how would you deploy this, how would you scale it, how would you debug issues. Use diagrams liberally. For each component, discuss tradeoffs—consistency versus availability, strong versus eventual consistency, synchronous versus asynchronous, etc. Discuss data modeling and storage considerations. Show familiarity with patterns used in large-scale systems—load balancing, caching, replication, sharding. For Staff level, go deeper—discuss how the team would be organized, how you'd approach operational automation, how you'd handle geographic distribution, etc.

Focus Topics

SLO-Driven Design & Error Budgets

Designing systems with explicit SLOs and using error budgets to guide reliability investment. Understanding how to instrument systems to track SLIs, how to use error budget to prioritize work, and how design decisions impact achievable SLOs.

Practice Interview

Study Questions

Observability & Monitoring Architecture

Designing comprehensive observability into the system—metrics that track business and technical health, logs that help with debugging, traces that help understand distributed flow. SLI instrumentation, alerting strategy.

Practice Interview

Study Questions

Multi-region & Geographic Architecture

Designing systems across geographic regions—handling latency, consistency across regions, failover strategies, and traffic routing. Understanding tradeoffs between replication and network costs.

Practice Interview

Study Questions

Operational Automation & Deployment Strategy

Designing systems to be operationally simple—automation of common operational tasks, safe deployment strategies such as canary, rolling, or blue-green, rollback capabilities, and monitoring for operational health.

Practice Interview

Study Questions

Distributed System Trade-offs & Consistency Models

Understanding fundamental tradeoffs in distributed systems—strong versus eventual consistency, availability versus consistency (CAP theorem), synchronous versus asynchronous communication. When to use each pattern and implications for operations.

Practice Interview

Study Questions

Design for Reliability & Failure Handling

Designing systems that continue operating under partial failures. Understanding circuit breakers, bulkheads, graceful degradation, and failure modes. Designing for geographic redundancy, data replication strategies, and consistency models that handle failures. How to design systems so that individual component failures don't cascade.

Practice Interview

Study Questions

Onsite Round 2: Distributed Systems & Architecture

60 min5 focus topicstechnical

What to Expect

A deep-dive technical interview focusing on distributed systems concepts, algorithms, and architecture patterns used in large-scale systems. The interviewer will discuss specific distributed systems challenges—consensus, replication, failure detection, data consistency, load balancing—and explore your understanding at depth. Expect questions about specific algorithms (Raft, Paxos concepts, etc.), when to use them, tradeoffs, and failure scenarios. This round assesses theoretical understanding combined with practical application to real systems.

Tips & Advice

Show deep understanding of distributed systems principles. Be familiar with consensus algorithms, replication strategies, and consistency models—understand the concepts even if you can't write code for them. Discuss specific systems you've worked with—how they handle certain challenges, what tradeoffs they make. When discussing algorithms, explain the intuition and tradeoffs, not just mechanical details. Relate concepts to real operational scenarios—what happens when you lose a majority, what happens during network partitions, how does this impact operations. Be precise with terminology—understand the difference between strong consistency, eventual consistency, causal consistency, etc. Show awareness of practical concerns—latency implications, failure impact, recovery time. For Staff level, discuss how you'd design systems using these concepts, mentor others on these principles, and make architecture decisions based on deep understanding.

Focus Topics

Load Balancing & Traffic Distribution Strategies

Understanding different load balancing approaches, consistent hashing, request routing, handling of uneven load, and stateful load balancing challenges. Tradeoffs between different strategies.

Practice Interview

Study Questions

Network Reliability & Timeout Handling

Understanding network failure modes—packet loss, latency, partitions. How systems should handle timeouts, retries, and idempotency. Understanding failure detection challenges caused by network complexity.

Practice Interview

Study Questions

Failure Detection & Recovery Mechanisms

Understanding how systems detect failures using health checks, heartbeats, etc., challenges in detection such as false positives and slow detection, recovery mechanisms, and how to design systems robust to different failure scenarios.

Practice Interview

Study Questions

Data Consistency Models & Guarantees

Understanding different consistency models—strong consistency, eventual consistency, causal consistency, etc. Understanding what each guarantees and implies for application behavior and operational complexity. Understanding how to maintain consistency under failures.

Practice Interview

Study Questions

Consensus & Replication Algorithms

Understanding consensus algorithms and their role in distributed systems. Knowledge of algorithm families such as Paxos and Raft, their properties, failure modes, and operational implications. Understanding quorum systems, split-brain scenarios, and recovery from failures.

Practice Interview

Study Questions

Onsite Round 3: Coding Interview (Systems-Focused)

60 min4 focus topicstechnical

What to Expect

A coding interview focused on systems programming and infrastructure-related coding challenges. You'll solve 1-2 coding problems, likely involving systems concepts like thread safety, resource management, or performance optimization. Problems may relate to building infrastructure components, optimization challenges, or distributed systems algorithms. For a Staff-level SRE, the bar is higher—clean, optimized code with good design. You should be able to discuss efficiency implications, tradeoffs, and how this code would behave in production.

Tips & Advice

Even at Staff level, coding interviews expect clean, working code. But the bar is higher for design and optimization. Think about edge cases and how your code would behave at scale. Consider memory usage, CPU efficiency, and how code would perform under load. Use appropriate data structures and algorithms. For systems-focused problems, think about concurrency, resource limits, and operational aspects. Be ready to optimize based on feedback. Discuss tradeoffs—is correctness more important than performance, or vice versa? How would you test this code? Write code that others on the team would want to maintain.

Focus Topics

Resource Management & Error Handling

Writing code that manages resources correctly—memory, files, connections, etc. Proper cleanup and error handling. Code that doesn't leak resources or crash ungracefully.

Practice Interview

Study Questions

Infrastructure as Code & Automation Scripts

Writing code for infrastructure automation—scripts for deployment, monitoring, or operational tasks. Clean, maintainable code that others can understand and modify. Proper error handling and logging.

Practice Interview

Study Questions

Performance-Critical Code & Optimization

Writing and optimizing code for performance—understanding algorithmic complexity, data structure choices, memory efficiency, cache efficiency. Profiling and identifying bottlenecks. Code that performs well at scale.

Practice Interview

Study Questions

Systems Programming & Concurrency

Writing code that handles concurrency correctly—thread safety, synchronization primitives, avoiding deadlocks and race conditions. Understanding performance implications of different concurrency approaches. Coding problems may involve multi-threaded systems or concurrent processing.

Practice Interview

Study Questions

Onsite Round 4: Infrastructure Automation & Tooling

60 min5 focus topicstechnical

What to Expect

A technical interview focused on infrastructure automation, tooling, and operational practices. The interviewer will discuss how you design and implement infrastructure automation—tools and frameworks you'd use, how you'd structure automation for maintainability and safety, how you'd approach rolling out infrastructure changes, and how you'd handle operational tooling needs. This may include discussion of container orchestration like Kubernetes, CI/CD systems, configuration management, and monitoring tools. You may be asked to architect an automation solution for a specific operational challenge.

Tips & Advice

Discuss automation in concrete terms—which tools would you use and why? How would you structure the automation to be maintainable? Show familiarity with modern infrastructure tools. For Kubernetes, understand the concepts beyond just deployment—how would you handle upgrades, what about monitoring, how would you handle failures. Discuss how you'd make infrastructure changes safely—testing, gradual rollout, rollback capability. Show that you think about operational concerns—visibility into what's running, debugging when things go wrong, capacity planning. For Staff level, discuss architecture decisions—how you'd organize automation, how you'd scale it, how you'd make it work across teams. Show that you understand tradeoffs—flexibility versus simplicity, automated versus manual, etc.

Focus Topics

Monitoring, Alerting & Observability Tooling

Designing and implementing comprehensive monitoring and alerting—metrics collection, log aggregation, distributed tracing. Understanding observability tools and platforms. How to design monitoring that's actually useful for operations.

Practice Interview

Study Questions

Infrastructure as Code & Configuration Management

Treating infrastructure as code—versioning infrastructure, reviewing changes, testing infrastructure changes, documenting infrastructure. Using IaC tools effectively, handling secrets, managing drift.

Practice Interview

Study Questions

CI/CD Pipeline Design & Implementation

Designing comprehensive CI/CD systems—build automation, testing, deployment automation, rollback capabilities. Understanding different deployment strategies such as canary, rolling, or blue-green, how to make deployments safe, and how to handle failure scenarios.

Practice Interview

Study Questions

Infrastructure Automation Framework & Tooling

Understanding and designing infrastructure automation systems using tools like Terraform, Ansible, or custom solutions. How to make automation maintainable, testable, and safe. How to handle state management, versioning, and rollback in infrastructure automation.

Practice Interview

Study Questions

Container Orchestration & Kubernetes Architecture

Deep understanding of Kubernetes architecture, deployment strategies, resource management, networking, storage, and operational practices. Understanding how to design reliable Kubernetes deployments, handle node failures, perform upgrades safely. Understanding security and isolation in containerized environments.

Practice Interview

Study Questions

Onsite Round 5: Behavioral & Leadership Interview

60 min5 focus topicsbehavioral

What to Expect

A behavioral interview assessing cultural fit, values alignment, and leadership capability. The interviewer will ask about your experience with cross-functional collaboration, how you lead technical initiatives, how you handle ambiguity and conflict, and how you've grown as a leader. At Staff level, focus is on your ability to influence across teams, mentor senior engineers, communicate with stakeholders at different levels, and drive complex technical decisions to completion. You'll be asked about specific situations you've handled and how they demonstrate Meta values—moving fast, being intellectually honest, building impact, etc.

Tips & Advice

Use the STAR method for behavioral questions—Situation, Task, Action, Result. Focus on your role and impact specifically, not just team achievements. Show how you handle ambiguity—when you don't have all information, how do you proceed? Demonstrate learning and growth by sharing examples of challenges you've overcome. Show emotional intelligence—how you handle disagreement, support teammates, navigate organizational dynamics. For Staff level, emphasize leadership—how you've influenced decisions, driven changes across teams, developed junior engineers. Share examples of times you've made unpopular but necessary technical decisions. Show how you balance speed and quality, individual contribution and team development. Discuss how you've handled failure and what you learned. Align your stories with Meta values—moving fast safely, being direct and honest, building things that matter.

Focus Topics

Technical Decision Making & Ownership

How you approach complex technical decisions—gathering data, involving stakeholders, making decisions with incomplete information. Examples of decisions you've made and owned, including decisions that were later changed (showing humility). How you think about tradeoffs.

Practice Interview

Study Questions

Meta Values Alignment (Move Fast, Honesty, Impact, etc.)

Demonstrating alignment with Meta values—examples of moving fast while maintaining quality, being intellectually honest even when it's unpopular, building things that create real impact. Showing how you'd fit into Meta culture.

Practice Interview

Study Questions

Communication & Stakeholder Management

How you communicate complex technical concepts to different audiences—engineers, managers, executives. How you handle misalignment and different perspectives. Examples of times you've communicated through crisis or ambiguity.

Practice Interview

Study Questions

Cross-Functional Leadership & Influence

Demonstrating ability to lead initiatives that span multiple teams without direct authority. How you influence technical decisions, build consensus, and drive changes across organizational boundaries. Examples of successfully navigating complex technical or organizational situations.

Practice Interview

Study Questions

Mentoring & Technical Talent Development

Your approach to developing engineers—how you've mentored junior and senior staff, helped teammates grow, created learning opportunities. Examples of engineers you've developed and their growth trajectory. How you approach coaching on technical and professional development.

Practice Interview

Study Questions

Frequently Asked Site Reliability Engineer (SRE) Interview Questions

Deployment and Release StrategiesHardTechnical

81 practiced

You need to schedule a rollout across interdependent microservices based on a dependency graph. Describe an algorithm to compute safe batches of services to deploy in parallel, handling cycles and optional parallelism while minimizing total rollout time and ensuring compatibility constraints.

Sample Answer

Approach (high level):1. Convert dependency graph to a directed graph G where edge A→B means B depends on A (A must be rolled out before B).2. Collapse strongly connected components (SCCs) to handle cycles: treat each SCC as an atomic unit (must be deployed together or require manual coordination).3. Build condensed DAG of SCCs. For each node compute its critical-path length (longest cumulative deploy time to leaves) to prioritize critical services.4. Perform a bounded-parallel topological scheduling: repeatedly pick all ready nodes (in-degree 0) but schedule them in priority order up to concurrency/resource limits and compatibility constraints. Use a min-heap of ongoing deployments by finish time to simulate time and free slots.5. Respect compatibility: before scheduling a node, ensure its new version is compatible with already-deployed neighbors (use compatibility matrix); if incompatible, delay or include dependent nodes in same batch.6. Produce batches as sets of nodes started at the same simulated time.

Algorithm sketch (Python-like pseudocode):

python

# nodes have: deploy_time, compatibility_check(targets)
sccs = tarjan_scc(graph)
dag = condense_graph(sccs)
for n in dag:
    n.critical = longest_path(n)  # DFS DP using deploy_time

ready = [n for n in dag if n.indegree==0]
# max_concurrency, resource_limits defined
time = 0
running = []  # heap of (finish_time, node)
batches = []  # list of (start_time, [nodes])

while ready or running:
    # try to fill slots
    slots = max_concurrency - len(running)
    candidates = sorted(ready, key=lambda x: -x.critical)
    to_start = []
    for c in candidates:
        if slots<=0: break
        if c.compatible_with_deployed(): 
            to_start.append(c); slots-=1

    if to_start:
        for n in to_start:
            ready.remove(n)
            heapq.heappush(running, (time + n.deploy_time, n))
        batches.append((time, [n.id for n in to_start]))
        continue

    # advance time to next finish
    finish_time, finished = heapq.heappop(running)
    time = finish_time
    mark_deployed(finished)
    for nb in finished.out_neighbors:
        nb.indegree -= 1
        if nb.indegree==0:
            ready.append(nb)

Key reasoning and trade-offs:- SCC condensation handles cycles by grouping mutually dependent services; you either deploy them as an atomic unit or require human coordination—avoids livelock.- Prioritizing by critical-path reduces total rollout time (scheduling heuristic for minimizing makespan).- Simulating time with a running-heap models real concurrency and resource limits precisely.- Compatibility checks may force co-deployments (add edges) or delays; incorporate them into readiness.Edge cases:- Large SCCs: break with feature flags/compatibility shims if possible.- Partial failures: include rollback hooks, canary gating, and exponential backoff retries.- Dynamic dependencies: re-evaluate readiness if graph changes mid-rollout.

This gives a deterministic, resource-aware batching plan that minimizes makespan while handling cycles and compatibility.

Failure Detection and Automated ResponseEasyTechnical

89 practiced

Compare active (synthetic/blackbox) monitoring vs passive (instrumentation/whitebox) monitoring. Provide concrete examples of signals each provides, strengths and weaknesses, and when you'd choose one over the other for detecting failures across a global e-commerce stack (CDN, API gateway, backend services). Discuss cadence, cost, and blind spots.

Sample Answer

Active (synthetic/blackbox) vs Passive (instrumentation/whitebox) monitoring

Quick definition- Active (blackbox): external probes simulate user traffic from many locations (HTTP checks, synthetic transactions, RUM is a hybrid). Treats system as a black box.- Passive (whitebox): collects internal telemetry (metrics, logs, traces, health endpoints, profilers) from services themselves.

Concrete signals- Active: global HTTP status codes, end-to-end latency, DNS resolution time, TLS handshake time, CDN cache hit/miss as seen by a probe, payment checkout flow success rates from probes.- Passive: per-service CPU/memory, internal error counters, request latencies by span, DB connection pool saturation, application logs with stack traces, deployment/version tags.

Strengths & weaknesses- Active strengths: detects customer-visible outages (global reach), validates routing/edge config, catches problems even when telemetry pipeline is down. Weakness: limited internal visibility, synthetic load may not hit all code paths, cost grows with cadence × locations.- Passive strengths: rich root-cause data, fine-grained SLOs, resource and error context. Weakness: blind to CDN/DNS/ISP issues and to telemetry pipeline failures; requires instrumentation and access.

When to choose (global e‑commerce stack)- CDN: active probes from multiple ISPs/regions to detect cache misconfig, TLS/DNS problems, or POP-level outages.- API gateway: combine both—active health checks for availability and end-to-end latency; passive metrics/traces for routing/middleware errors and per-backend error rates.- Backend services: prioritize passive (metrics, traces, logs) to diagnose CPU spikes, DB latency, or code exceptions; add synthetic tests for critical user journeys (checkout) that exercise backend integration.

Cadence, cost, blind spots- Cadence: critical paths (checkout, login) run synthetic checks every 15–60s globally; less critical every 5–15m. Passive is continuous but watch cardinality of metrics/logs to control cost.- Cost: active cost = probes × locations × frequency (can be substantial at high cadence globally). Passive cost = ingest volume (high-cardinality traces/logs).- Blind spots: active misses internal state and low-volume edge-cases; passive misses network/DNS/CDN reachability and failures of the monitoring pipeline itself.

Recommended strategy- Use both: deploy global synthetic probes for customer-facing availability and edge issues; rely on rich instrumentation for root-cause analysis and SLOs. Alerting: trigger high-level page- or checkout-service pager from active outage; attach passive-context playbooks that surface traces/metrics automatically when paging.

Automation and ScriptingEasyTechnical

73 practiced

Describe a practical testing strategy for automation scripts: how you would structure unit tests, integration tests, use mocks and fixtures, test idempotency and side effects, and run these tests in CI. Include considerations for flaky tests and running tests that require cloud resources.

Sample Answer

I’d design a layered, SRE-focused test strategy that keeps automation reliable, fast, and safe.

Unit tests:- Small, fast, pure-function tests for logic (parsing, decision rules). Use pytest/Jest with fixtures for input cases.- Heavy use of mocks for external APIs, secrets, subprocesses so tests never touch infra.- Validate idempotency at function level by calling handlers twice and asserting same state/no duplicate side effects.

Integration tests:- Test interactions between components (e.g., automation -> API gateway -> config store). Use a separate test environment or docker-compose to run real services.- Use test doubles sparingly — prefer real infra for critical flows (auth, provisioning) in isolated namespaces/projects.

Fixtures & test data:- Reusable fixtures to set up and tear down resources; use factory patterns and unique prefixes to avoid collisions.- Clean-up hooks and exponential backoff retries to handle eventual consistency.

Testing side effects & idempotency:- For scripts that modify state, run in “dry-run” mode and assert intended API calls; for real runs, snapshot pre/post state and assert idempotent re-application yields no changes.- Use transaction-like rollbacks or resource tagging for reliable cleanup.

CI integration:- Fast unit tests run on every PR. Integration and cloud-dependent tests run on merge / nightly pipelines with isolated IAM roles and budgets.- Parallelize tests, cache dependencies, and gate merges on test suites and linting.

Flaky tests:- Detect via retry-once with logging, mark as flaky and quarantine until fixed. Record flakiness metrics and require a root-cause before unquarantining.- Prefer deterministic setups (fixed clocks, seeded randomness, mocked time) to reduce flakiness.

Cloud resource considerations:- Use ephemeral test accounts/projects with limited quotas and automated teardown; employ budgets and alerts to prevent runaway costs.- Tag and sweep orphaned resources; run cost-sensitive tests less frequently and mock expensive services where possible.

This balances speed, reliability, and safety for SRE automation.

Capacity Planning and Resource OptimizationMediumTechnical

37 practiced

Write (or describe) a Python function propose_instances(timeseries_cpu_percent, per_instance_cpu_capacity_percent, target_p95_util_percent) that, given CPU utilization samples for existing instances over time, proposes the number of identical instances needed to keep p95 utilization below the target. Assume adding instances divides utilization proportionally. Explain handling of missing values and rounding.

Sample Answer

Approach: compute the p95 of observed per-instance CPU% across time (ignoring/filling sparse samples), then find smallest integer N (>=1) such that scaled p95 = current_p95 / N * (current_total_instances) <= target_p95_util_percent. If input is per-instance timeseries (list of lists) assume identical instances and proportional division: adding instances divides each sample by new instance count ratio.

python

import math
import numpy as np
from typing import List, Optional

def propose_instances(timeseries_cpu_percent: List[List[Optional[float]]],
                      per_instance_cpu_capacity_percent: float,
                      target_p95_util_percent: float) -> int:
    """
    timeseries_cpu_percent: list over instances -> list of timestamped samples (None for missing)
    per_instance_cpu_capacity_percent: usually 100.0
    target_p95_util_percent: desired p95 per-instance utilization (e.g., 70.0)
    Returns: proposed number of identical instances
    """
    # Flatten contemporaneous snapshots: for each timestamp, sum across instances to get total load at that time.
    # Assumes equal-length inner lists; if not, align by index and treat missing as NaN.
    arr = np.array([[np.nan if x is None else x for x in inst] for inst in timeseries_cpu_percent], dtype=float)
    # per-timestamp total utilization (%) = mean per-instance * num_instances_present
    num_instances_current = arr.shape[0]
    # For each column (timestamp), compute total load as sum of available samples scaled to full set:
    # If some instances missing, compute mean of available and multiply by num_instances_current.
    col_means = np.nanmean(arr, axis=0)  # ignores NaN
    # If an entire column is NaN -> drop that timestamp
    valid = ~np.isnan(col_means)
    if not np.any(valid):
        return max(1, num_instances_current)
    total_load_percent = col_means[valid] * num_instances_current
    # compute per-instance utilization under current topology as total_load / current_instances (same as col_means)
    # To compute p95 of per-instance utilization, use the per-timestamp per-instance mean values
    per_instance_samples = col_means[valid]
    current_p95 = np.percentile(per_instance_samples, 95)
    # If capacity per instance < target threshold, need to ensure we don't propose > capacity
    # find smallest N such that current_p95 * (num_instances_current / N) <= target_p95
    # rearrange: N >= num_instances_current * current_p95 / target_p95
    required = (num_instances_current * current_p95) / max(1e-6, target_p95_util_percent)
    N = max(1, math.ceil(required))
    # also cap by instance CPU capacity (can't exceed per-instance capacity percent)
    if target_p95_util_percent > per_instance_cpu_capacity_percent:
        # target unrealistic; instead aim for capacity
        N = max(1, math.ceil((num_instances_current * current_p95) / per_instance_cpu_capacity_percent))
    return int(N)

Key points:- Missing samples: ignore (np.nan) and compute column means from available instances; drop timestamps with no data.- Rounding: ceil the computed required instances — must be integer and conservative.- Complexity: O(I*T) where I instances, T timestamps.- Edge cases: all data missing -> return current or 1; target 0 -> treat as tiny epsilon to avoid division by zero; input irregular lengths handled by treating missing as None.

Container Orchestration and Kubernetes OperationsEasyTechnical

54 practiced

Describe taints and tolerations in Kubernetes. Provide a clear example of tainting a node to accept only spot-instance tolerant workloads and explain how you would ensure critical control plane or monitoring pods still run on that node when necessary.

Sample Answer

Taints and tolerations are Kubernetes mechanisms to influence scheduling by making nodes repel pods (taints) unless a pod explicitly tolerates them. A taint has key=value:effect (effects: NoSchedule, PreferNoSchedule, NoExecute). A toleration in a PodSpec declares it can be scheduled onto nodes with matching taints.

Example: mark a node-pool of spot instances so only spot-tolerant workloads run there.

Taint the node (or node pool):kubectl taint nodes ip-spot-01 spot=spot-instance:NoSchedule

A spot-tolerant pod (deployment) adds a toleration so it can land on tainted nodes:

yaml

apiVersion: apps/v1
kind: Deployment
metadata: {name: my-batch}
spec:
  template:
    spec:
      tolerations:
      - key: "spot"
        operator: "Equal"
        value: "spot-instance"
        effect: "NoSchedule"
      containers:
      - name: worker
        image: my-batch:latest

Ensuring critical control-plane or monitoring pods still run:- Add specific tolerations to those system/monitoring DaemonSets/Deployments (e.g., kube-proxy, node-exporter, Prometheus node-exporter) so they tolerate the taint: - Use key: "spot", operator: "Exists", effect: "NoSchedule" if you want them to run regardless of value.- Alternatively, use a less strict taint like PreferNoSchedule to make spot nodes a lower-priority target, or keep dedicated non-spot node-pools for critical workloads and set nodeAffinity there.Trade-offs: giving system pods tolerations risks running critical services on ephemeral spot nodes; prefer reserving some on-demand nodes for control/monitoring or replicate critical collectors across both pools with appropriate tolerations and leader-election.

Blameless Postmortem and Organizational LearningMediumTechnical

54 practiced

Write a concise three-paragraph executive summary of this hypothetical outage: 'An authentication-service outage after a schema migration caused a two-hour downtime affecting 30% of API traffic and estimated $50k revenue impact that day.' Include prioritized corrective actions with estimated timelines suitable for C-suite consumption.

Deployment and Release StrategiesMediumSystem Design

98 practiced

Design a CI/CD pipeline for a multi-service monorepo that supports feature branches, automated tests, artifact promotion, gated deployments, and emergency rollback. Specify how you would store artifacts, ensure reproducible builds, and support both scheduled and on-demand canary rollouts.

Sample Answer

Requirements & constraints:- Multi-service monorepo, support feature branches, automated tests, artifact promotion, gated deployments, canary (scheduled + on-demand), emergency rollback, reproducible builds, secure pipeline.

High-level architecture:- Git monorepo (feature/* branches) → CI server (e.g., GitHub Actions/GitLab CI/Jenkins) → Artifact repository (e.g., Nexus/Artifactory/OCI registry) → CD controller (Argo CD/Spinnaker) → Kubernetes clusters (canary and prod) → Observability (Prometheus/Grafana, Loki, Jaeger) and policy engine (OPA/Gatekeeper).

Pipeline components & flow:1. Branch CI: - On push to feature branch, run deterministic builds in immutable build image (buildkite/docker-in-docker or Kaniko for containers). - Run unit tests, linters, SBOM generation, and reproducible build inputs (lockfiles, commit hash, build args). - Produce signed artifacts: container images with content-addressable tags (sha256) and metadata (commit, branch, build id). - Push to artifact repo into a staging namespace: images/{service}/{commit-sha}.

2. PR gating: - Run integration tests in ephemeral environment (namespaced k8s), security scans, and performance smoke tests. - Block merge unless checks pass via status checks.

3. Promotion & release: - Promotion job moves artifact from staging to candidate registry tag (e.g., vX.Y.Z-candidate) using artifact repo metadata and record provenance in a manifest store (GitOps repo or database). - Create a deployment manifest in GitOps repo referencing exact image digests.

4. Canary deployments: - CD (Argo/CD or Spinnaker) applies manifests to cluster canary namespace or uses traffic-splitting (Istio/Contour/Ingress) to route a % of traffic. - Support scheduled canaries: cron-driven promotion pipelines trigger canary rollout using the same manifests. - Support on-demand canaries: manual trigger in CD UI or via API.

5. Gated full rollout: - Automated health checks and SLO-based analysis (latency, error rate) run for a configurable validation window. - If metrics pass, automated promotion to full production gradually (increase traffic ramp). - If signals fail, pipeline automatically gates and triggers rollback.

6. Emergency rollback: - Maintain immutable history of previous production digests; rollback job in CD can restore prior manifests to 100% traffic within minutes. - Provide a one-click emergency rollback and automated circuit-breaker to cut traffic to a safe endpoint.

Reproducible builds & artifact storage:- Use content-addressable images (digest tags), lock dependencies, build from hermetic build images, store build metadata (commit, build args, SBOM, provenance) in artifact repo and a manifest DB (or GitOps repo).- Sign images (cosign) and verify signatures in CD before deploy.- Use immutable retention policies and GC for artifact repo; quarantine unpromoted artifacts.

Security & pipeline hardening:- Least-privilege service accounts for CI/CD, secrets in vault (HashiCorp/SealedSecrets), image vulnerability scanning, OPA policies to enforce approved registries and signature verification.- Audit logs for promotions/rollbacks.

Observability & verification:- Canary analysis using Prometheus alerts + lightweight anomaly detection (e.g., Kayenta or custom): automatically compare baseline vs canary on errors, p50/p95 latency, saturation.- Alert SRE on gating failures and provide automated remediation playbooks.

Trade-offs:- GitOps + declarative manifests gives strong auditability but needs discipline around manifest updates.- Argo CD for k8s-native, Spinnaker for complex multi-cloud strategies.

This design ensures reproducible, auditable artifacts, safe gated rollouts, scheduled and manual canaries, and quick emergency rollback—aligned with SRE reliability goals.

Failure Detection and Automated ResponseEasyTechnical

73 practiced

Explain the difference between liveness and readiness probes in Kubernetes. As an SRE, how would you design and implement both for a stateless HTTP microservice that depends on a downstream cache and database? Include what each probe should check, failure modes to consider, appropriate HTTP status codes, and how Kubernetes reacts to probe failures. Also describe strategies for dependency degradation without restarting the pod.

Sample Answer

Liveness vs Readiness (short):- Liveness: “Am I alive?” — if it fails, K8s restarts the container (fixes stuck/deadlocked processes).- Readiness: “Can I receive traffic?” — if it fails, K8s removes the Pod from Service endpoints but does NOT restart it.

Design for a stateless HTTP microservice with cache + DB dependency:

What each probe should check- Liveness (simple, fast): - Check process health: local app process responsive, event loop not blocked, memory/threads under thresholds. - Minimal check: HTTP GET /healthz that verifies app internal heartbeat and health flags (no remote IO). - Return 200 OK when alive; non-2xx otherwise.- Readiness (dependency-aware): - Endpoint: HTTP GET /ready that confirms ability to serve requests: - Verify cache (e.g., attempt lightweight GET or ping to Redis — short timeout). - Verify DB reachable (simple SELECT 1 with short timeout). - Check migration/maintenance flags. - Return 200 OK when all essential dependencies healthy; 503 Service Unavailable (or 429 for rate-limited degraded mode) when not ready.

Failure modes to consider- Transient network blips: use short timeouts and retries in readiness logic; avoid flapping by configuring initialDelaySeconds, periodSeconds, failureThreshold.- Slow dependencies: consider readiness failure (remove from LB) instead of liveness restart.- Deadlocks/GC pauses: liveness should detect and trigger restart.- Partial dependency loss (cache down but DB up): prefer graceful degradation.

Kubernetes reactions- Liveness probe failure -> kubelet kills container and restarts it (subject to restartPolicy).- Readiness probe failure -> Pod is removed from Endpoints; traffic stops but Pod remains running.- Configure probe timings: e.g., liveness less tolerant (shorter failureThreshold for stuck processes), readiness more tolerant (higher failureThreshold + retries).

Dependency degradation strategies without restarting- Circuit breakers: open circuit to cache after failures; switch to DB or degraded mode.- Graceful degradation: serve stale data from local in-memory cache or degrade features (read-only, disable noncritical endpoints).- Bulkhead/fallbacks: route heavy operations to background jobs; return cached responses or informative 503 with Retry-After.- Feature flags: toggle expensive features off when dependencies unhealthy.- Observability & alerts: emit metrics and alert when readiness fails but liveness OK to avoid unnoticed degraded traffic.

Practical config tips- Keep probes lightweight and non-mutating.- Use separate endpoints (/healthz, /ready) and internal checks with strict timeouts (< probe period).- Tune initialDelay/timeout/failureThreshold to match startup and transient conditions.

Automation and ScriptingEasyTechnical

83 practiced

Explain GitOps and describe how operational automation and scripts should be integrated into a GitOps model. Cover repository layout for automation manifests, how automation triggers from repo changes, policy enforcement, and how to handle emergency manual changes safely.

Sample Answer

GitOps is the practice of using Git as the single source of truth for both desired system state (manifests, IaC) and the change workflow; an automated controller continuously reconciles cluster state to what’s in Git. For an SRE, this means operational automation must be declarative, auditable, and driven by Git change events.

Repository layout for automation manifests- Use separate repos or well-scoped directories: infrastructure/ (Terraform, cloud resources), platform/ (cluster-level manifests), apps/<team>/<env>/ (app manifests per environment).- Keep generated artifacts out of source; store reusable modules/helm-charts in charts/ or modules/.- Use branch-per-environment or a combination: main for prod, release branches for staging, feature branches for changes.

How automation triggers from repo changes- A GitOps operator (ArgoCD/Flux) watches Git and applies diffs automatically when commits merge to the target branch.- CI pipelines validate changes (lint, unit tests, security scans) on PRs; only merged/approved commits reach the operator.- For operational scripts (runbooks, automation tasks) expose them as declarative K8s Jobs/CRs or as pipeline definitions; commit triggers CI/CD to execute or publish job definitions for the operator.

Policy enforcement- Enforce PR-based workflows with required reviewers and automated checks.- Apply admission policies (OPA/Gatekeeper) in-cluster and pre-merge policy as code (conftest/OPA) in CI to block non-compliant changes.- Use signed commits or require signed tags for production promotion.

Handling emergency manual changes safely- Allow an emergency path but keep it auditable: use a protected branch with an emergency approver process (signed PRs or dedicated emergency workflow).- If you must change cluster state directly, require the operator to record a “revert” commit: after the manual action, create a Git commit that reflects the emergency state so Git remains source of truth.- Prefer an automated “out-of-band” toggle: operator can accept a temporary override CR with TTL, which must be reviewed and then reconciled back into Git.- Post-incident: create a Git-backed remediation commit and run a postmortem to remove ad-hoc work and harden CI/policies.

Why this works- Declarative manifests + Git review pipeline give auditability, reproducibility and safer rollbacks.- CI gates and in-cluster policies provide defense-in-depth.- An auditable emergency path preserves reliability while keeping Git as the canonical state.

Capacity Planning and Resource OptimizationHardTechnical

22 practiced

Explain how buffer pool sizing in an OLTP database affects read latency and IO amplification when the working set is slightly larger than available RAM. Using cache-miss curves and cost modeling, propose a method to choose buffer size that minimizes total cost (memory cost + IO cost), and describe experiments to measure the 'knee' in the hit-rate curve.

Sample Answer

Situation: You manage an OLTP service whose working set slightly exceeds RAM; you must choose buffer-pool size to minimize combined memory cost and IO cost while keeping read latency SLOs.

Concepts / intuition:- Hit rate vs buffer size is a concave increasing curve with a "knee" where additional memory yields diminishing returns.- When working set > RAM, miss rate (and IO) rises non-linearly; IO amplification occurs because misses trigger more reads, prefetches, checkpoint-induced churn, and filesystem cache interactions.- Read latency = hit_latency * hit_rate + miss_latency * miss_rate; miss_latency includes IO queueing and amplification.

Cost model:- Let M = buffer size, C_mem per GB (annualized), C_io per IO (or per MB read).- Measure or estimate hit_rate(M) from cache-miss curve.- TotalCost(M) = C_mem * M + C_io * IO_rate(M)- IO_rate(M) = request_rate * (1 - hit_rate(M)) * avg_IOs_per_miss (captures amplification).

Method to choose M:1. Empirically measure hit_rate(M) for a range around expected knee (see experiments).2. Fit a smooth function (e.g., logistic or piecewise power-law) to hit_rate(M).3. Compute TotalCost(M) and find argmin M subject to latency/SLO constraints (e.g., enforce miss_rate ≤ threshold).4. If multiple minima due to discrete pricing, choose minimal M satisfying both cost and SLO.

Experiments to find the knee:- Controlled workload replay (real traffic or representative trace) at production request rate.- Sweep M from low to slightly > working set in small steps (e.g., 1–5% increments).- For each M run long enough for steady-state (several minutes to an hour depending on workload) and measure: hit_rate, miss latency, IO/sec, IO bytes, queue depth, and CPU.- Plot hit_rate vs M and derivative d(hit_rate)/dM; knee ≈ M where derivative drops below a threshold (e.g., 10% of max slope) or where second derivative changes sign.- Validate amplification by measuring avg_IOs_per_miss (bytes read per miss) and latency tail percentiles.- Run sensitivity tests: higher concurrency, background compaction/checkpoint activity, and different access skew to ensure knee stability.

Operationalize:- Automate periodic sweeps in staging and trigger re-evaluation when workload or cost metrics change.- Use the model to produce a recommended buffer setting and expected latency/cost trade-offs; route decisions through SLO governance.

Practice Site Reliability Engineer (SRE) questions across all topics

Additional Information

Want to create your own tailored preparation guide using our deep research?

Get Started for Free

Interview-Ready Courses

Visual-first, interactive, structured learning paths

Browse Site Reliability Engineer (SRE) jobs

AI-enriched listings across hundreds of company career pages

Explore Jobs

Meta Staff Site Reliability Engineer Interview Preparation Guide

Interview Process Overview

Interview Rounds

Recruiter Screening

What to Expect

Tips & Advice

Focus Topics

Motivation & Fit for Meta

Practice Interview

Study Questions

Understanding of SRE Discipline & Philosophy

Practice Interview

Study Questions

Mentorship & Team Development Approach

Practice Interview

Study Questions

Examples of Technical Leadership & Impact

Practice Interview

Study Questions

Career Narrative & Staff-Level Progression

Practice Interview

Study Questions

Technical Phone Screen 1: Infrastructure & Systems Knowledge

What to Expect

Tips & Advice

Focus Topics

Reliability Engineering Principles & SLOs

Practice Interview

Study Questions

Capacity Planning & Scalability

Practice Interview

Study Questions

Large-Scale System Architecture Design

Practice Interview

Study Questions

Infrastructure Automation Frameworks & Tooling

Practice Interview

Study Questions

Performance Monitoring & Observability Design

Practice Interview

Study Questions

Technical Phone Screen 2: Incident Response & Troubleshooting

What to Expect

Tips & Advice

Focus Topics

Post-Incident Review & Learning Process

Practice Interview

Study Questions

Communication & Coordination During Incidents

Practice Interview

Study Questions

Decision Making & Trade-offs Under Pressure

Practice Interview

Study Questions

Incident Triage & Root Cause Analysis

Practice Interview

Study Questions

Troubleshooting Methodology for Distributed Systems

Practice Interview

Study Questions

Onsite Round 1: Systems Design Interview

What to Expect

Tips & Advice

Focus Topics

SLO-Driven Design & Error Budgets

Practice Interview

Study Questions

Observability & Monitoring Architecture

Practice Interview

Study Questions

Multi-region & Geographic Architecture

Practice Interview

Study Questions

Operational Automation & Deployment Strategy

Practice Interview

Study Questions

Distributed System Trade-offs & Consistency Models

Practice Interview

Study Questions

Design for Reliability & Failure Handling