Spotify Site Reliability Engineer (Mid-Level) Interview Preparation Guide

Site Reliability Engineer (SRE)

Spotify

Mid Level

7 rounds

Updated 6/12/2026

Spotify's Site Reliability Engineer interview process is a rigorous, multi-stage evaluation designed to assess both technical depth and operational excellence. The process combines phone-based technical assessments with comprehensive on-site interviews covering infrastructure automation, system design, incident response, and cultural alignment. For mid-level candidates, the emphasis is on demonstrated experience building reliable systems, strong collaboration skills, and the ability to own projects end-to-end with some mentorship of junior team members.

Interview Rounds

Recruiter Screening

40 min4 focus topicsbehavioral

What to Expect

Your initial contact with Spotify's recruitment team via video or phone call. This 30-45 minute conversation focuses on understanding your background, career motivations, and alignment with Spotify's culture and the SRE role. The recruiter will discuss the position, team dynamics, and expectations. This is your opportunity to demonstrate communication skills, enthusiasm for reliability engineering, and understanding of what makes Spotify's infrastructure unique at scale.

Tips & Advice

Research Spotify's engineering culture and recent infrastructure initiatives before the call. Prepare 2-3 concrete examples of projects where you've improved system reliability or reduced operational overhead. Be specific about your role and impact—use metrics when possible (e.g., 'reduced incident response time from 30 minutes to 5 minutes'). Clarify expectations around on-call responsibilities and incident response. Demonstrate curiosity about Spotify's tech stack and challenges at scale. The recruiter is gauging communication clarity and cultural fit, so be authentic and enthusiastic.

Focus Topics

Concrete Project Examples

Prepare 2-3 specific examples from your SRE work that demonstrate impact: infrastructure improvements, automation projects that reduced toil, incident response improvements, or capacity planning initiatives. Have metrics ready (downtime reduced, cost savings, response time improvements).

Practice Interview

Study Questions

Communication & Collaboration Skills

Be clear and concise when explaining technical concepts. Demonstrate how you communicate with development teams, incident commanders during outages, and stakeholders about operational changes.

Practice Interview

Study Questions

Spotify Culture & Values Alignment

Familiarity with Spotify's engineering culture, their approach to distributed systems, and how they balance speed with reliability. Understand their philosophy around blameless incident response and continuous improvement.

Practice Interview

Study Questions

Career Motivation & SRE Path

Understanding why you're interested in reliability engineering and specifically interested in joining Spotify. Be prepared to discuss your journey into SRE, what excites you about the role, and how your past experience aligns with Spotify's infrastructure needs.

Practice Interview

Study Questions

Technical Phone Screen

60 min6 focus topicstechnical

What to Expect

A focused 60-minute phone interview assessing your Linux systems knowledge, operational understanding, and ability to solve infrastructure problems. An experienced Spotify engineer will ask questions about system fundamentals, shell scripting, networking, and basic infrastructure operations. You may be asked to explain how you'd approach operational scenarios or discuss your hands-on experience with systems administration and automation.

Tips & Advice

Brush up on Linux command-line proficiency and system administration fundamentals. Be ready to discuss your experience with shell scripting (bash) for automation. Understand networking basics (TCP/IP, DNS, load balancing concepts) and how they apply to distributed systems. When answering questions, walk through your thinking process aloud and ask clarifying questions if scenarios are vague. For mid-level candidates, demonstrating practical hands-on experience and the ability to architect simple automation solutions matters more than theoretical depth. Have examples ready of scripts or tools you've built to reduce operational toil.

Focus Topics

Container & Orchestration Basics

Foundational knowledge of containerization (Docker concepts), container image management, and container orchestration platforms (particularly Kubernetes). Understand how containers simplify deployment and enable infrastructure automation.

Practice Interview

Study Questions

Incident Response Basics

Experience responding to production incidents. Understand escalation procedures, communication during incidents, and the importance of detailed observation and hypothesis testing. Be ready to discuss a challenging incident you've handled.

Practice Interview

Study Questions

System Performance & Troubleshooting

Ability to diagnose performance bottlenecks using tools like top, vmstat, iostat, and perf. Understand CPU, memory, disk I/O, and network metrics. Know how to identify and resolve common performance issues like high context switching, memory leaks, or disk saturation.

Practice Interview

Study Questions

Networking Fundamentals

Understanding of TCP/IP stack, DNS resolution and caching, HTTP/HTTPS protocols, load balancing concepts, routing, and firewalls. Know how to diagnose network issues using tools like ping, traceroute, netstat, and tcpdump. Understand the differences between TCP and UDP and when each is appropriate.

Practice Interview

Study Questions

Linux Systems & Administration

Deep knowledge of Linux operating systems including process management, file systems, permissions, user management, networking configuration, and system monitoring. Be comfortable with commands like ps, top, iostat, netstat, and systemd. Understand how processes, threads, and resource management work at the OS level.

Practice Interview

Study Questions

Shell Scripting & Automation

Practical experience writing bash scripts for operational tasks. Be able to write simple scripts for file processing, log analysis, system monitoring, and deployment automation. Understand common scripting patterns and best practices. Be ready to discuss limitations of shell scripts and when to choose other languages.

Practice Interview

Study Questions

System Design Phone Screen

60 min5 focus topicssystem design

What to Expect

A 60-minute technical phone interview focused on your ability to design reliable, scalable systems. You'll be presented with infrastructure design problems (e.g., 'Design a monitoring and alerting system for Spotify', 'How would you architect a deployment pipeline for high-frequency releases?'). This round assesses your understanding of distributed systems principles, trade-offs between reliability and complexity, scalability patterns, and how to design systems that handle Spotify's scale. You're expected to think through design choices, ask clarifying questions, and discuss trade-offs clearly.

Tips & Advice

Start by asking clarifying questions to understand the problem scope, scale, and constraints. Work through the design systematically: begin with simple architecture, identify bottlenecks, and incrementally add components (monitoring agents, data collectors, alert processors, storage backends). Draw diagrams if possible (even ASCII art over phone). Discuss trade-offs explicitly: consistency vs. availability, latency vs. durability, cost vs. performance. For mid-level candidates, demonstrate solid understanding of reliability patterns, scaling techniques, and the ability to design systems that handle millions of events. Reference real-world examples (e.g., how you'd apply similar patterns from your experience). Be ready to justify design decisions and adapt your design when given new constraints.

Focus Topics

SLOs, SLIs, and Error Budgets

Understanding Service Level Objectives (SLOs) and how to design systems to meet them. Know how SLIs (Service Level Indicators) measure success and how error budgets guide operational decisions. Understand how to balance feature velocity with reliability.

Practice Interview

Study Questions

Reliability Patterns & Fault Tolerance

Understanding patterns for building fault-tolerant systems: redundancy, failover, circuit breakers, bulkheads, graceful degradation, and retry strategies. Know how to design systems that degrade gracefully under load or failures rather than cascading failures.

Practice Interview

Study Questions

Trade-offs in System Design

Ability to articulate trade-offs in architectural decisions: consistency vs. availability, latency vs. durability, cost vs. performance, operational simplicity vs. feature richness. Demonstrate that you think pragmatically about these trade-offs.

Practice Interview

Study Questions

Building Scalable Infrastructure

Principles for scaling systems to handle 10x or 100x traffic growth. Understand horizontal vs. vertical scaling, load distribution, database partitioning/sharding, caching strategies, and how to identify and address bottlenecks. Discuss real-world capacity planning.

Practice Interview

Study Questions

Designing Monitoring & Alerting Systems

Architecture for collecting metrics from thousands of services, storing time-series data efficiently, processing alerts, and notifying on-call engineers. Consider data collection mechanisms (push vs. pull), storage backends, query performance, retention policies, and alert routing strategies. Understand common tools and their trade-offs.

Practice Interview

Study Questions

On-Site Round 1: Infrastructure & Automation

60 min5 focus topicstechnical

What to Expect

A 60-minute on-site interview focused on infrastructure automation, Infrastructure as Code (IaC), and your ability to build tools that reduce operational toil. You may be asked to write code (in your preferred language or Python/Go), discuss past automation projects, or solve infrastructure automation problems. The interviewer will assess your software engineering practices applied to infrastructure: code quality, testability, documentation, and ability to build maintainable systems.

Tips & Advice

Be prepared to write code in a language you're comfortable with. Focus on code clarity, error handling, and practical solutions over clever code. If asked to code infrastructure tools, demonstrate good practices: modularity, testability, logging, and handling edge cases. Discuss your approach to Infrastructure as Code—templates, configuration management, versioning, and testing. Walk through a real infrastructure automation project you've built: what problems it solved, how you designed it, challenges you faced, and lessons learned. For mid-level candidates, the expectation is that you can architect solutions and write clean, maintainable code, not necessarily perfectly optimized code. Be ready to discuss trade-offs in your design decisions.

Focus Topics

Programming for Operations

Writing code for operational tasks with proper error handling, logging, monitoring, and testing. Discuss code patterns that make operational code reliable and maintainable. Know when to use different languages (Python for rapid iteration, Go for performance-critical tools).

Practice Interview

Study Questions

Deployment Automation & Orchestration

Experience with deployment pipelines, CI/CD systems, and orchestrating complex deployments. Understand canary deployments, blue-green deployments, and rollback strategies. Be ready to discuss how you handle deployments at scale and minimize downtime.

Practice Interview

Study Questions

Configuration Management

Strategies for managing configurations across many systems. Understand the difference between infrastructure configuration and application configuration. Discuss secrets management, environment-specific configurations, and configuration validation.

Practice Interview

Study Questions

Automation Frameworks & Tools

Experience building automation using tools like Ansible, Chef, Puppet, or custom frameworks. Understand idempotency, error handling, and orchestrating multi-step deployments. Be comfortable writing automation code in languages like Python, Go, or bash.

Practice Interview

Study Questions

Infrastructure as Code (IaC)

Experience with tools like Terraform, CloudFormation, or Ansible. Understand how to define infrastructure declaratively, version control it, test it, and apply changes safely. Know patterns for managing environments, secrets, and configurations. Discuss how IaC reduces manual errors and enables reproducible infrastructure.

Practice Interview

Study Questions

On-Site Round 2: System Design & Reliability Architecture

60 min5 focus topicssystem design

What to Expect

A 60-minute on-site interview diving deep into system design and architectural thinking. You'll discuss how to architect reliable systems at Spotify's scale. The interviewer may present a complex infrastructure design challenge: 'Design a distributed cache system for Spotify's music catalog', 'How would you architect a system to handle real-time streaming to millions of users?', or similar. This round evaluates your ability to think about systems holistically, understand trade-offs, discuss scalability and reliability patterns, and explain your design rationale clearly.

Tips & Advice

Take time to understand the requirements and constraints before jumping into design. Ask clarifying questions about scale, latency requirements, consistency requirements, and failure scenarios. Start with a simple design and incrementally add components as you identify bottlenecks. Use diagrams to communicate your design clearly. For each major component, discuss why it exists, what it does, and how it contributes to reliability and performance. Discuss trade-offs explicitly: Is this design consistent or available? Synchronous or asynchronous? Push or pull? Discuss real-world examples from your experience where similar patterns apply. Be prepared to handle 'what if' scenarios: 'What if this component fails?', 'How do you handle 10x traffic growth?'. For mid-level SREs, demonstrate solid understanding of distributed systems principles and ability to design systems that balance multiple concerns.

Focus Topics

Database Scaling Strategies

Approaches to scaling databases: replication, sharding, read replicas, and handling distributed transactions. Understand the trade-offs between different approaches and when each is appropriate. Discuss backup and disaster recovery strategies.

Practice Interview

Study Questions

High Availability Patterns

Architectural patterns for achieving high availability: active-active replication, automated failover, circuit breakers, retry logic with exponential backoff. Understand how to design systems that minimize downtime and recover quickly from failures.

Practice Interview

Study Questions

Caching & Content Delivery Strategy

Different caching layers (in-process, Redis, CDN), cache invalidation strategies, and patterns like write-through, write-back, and write-around. Understand when caching helps and when it adds complexity. Discuss CDN architecture for global content distribution.

Practice Interview

Study Questions

Distributed Systems Design

Understanding of distributed system principles: CAP theorem, eventual consistency, fault tolerance, and communication patterns. Be comfortable designing systems with multiple independent components that communicate over the network. Understand challenges like network partitions and clock synchronization.

Practice Interview

Study Questions

Scalability & Performance Architecture

Designing systems that scale horizontally to handle massive load. Understand partitioning strategies, load distribution, connection pooling, and resource management. Know how to identify and eliminate single points of failure and bottlenecks.

Practice Interview

Study Questions

On-Site Round 3: Incident Response & Operations

60 min5 focus topicsbehavioral

What to Expect

A 60-minute on-site interview focused on operational excellence, incident response, and troubleshooting. You may be asked behavioral questions about incidents you've handled, presented with complex troubleshooting scenarios, or asked how you'd approach operational challenges. The interviewer assesses your problem-solving approach, how you think under pressure, communication during incidents, and your ability to learn from failures. This round emphasizes practical operational skills and judgment.

Tips & Advice

Prepare 3-4 detailed incident examples using the STAR method: Situation (what was the incident?), Task (what were you responsible for?), Action (what did you do?), Result (what was the outcome?). Focus on complex incidents where your problem-solving and collaboration made a difference. Include metrics: How did you detect it? How quickly was it resolved? What did you learn? Be ready to discuss a past incident in detail: timeline, what you tried, dead ends you pursued, how you finally resolved it. Discuss your approach to post-incident reviews (blameless, focusing on systems improvements). Talk about how you balance incident response with preventing similar incidents. For mid-level candidates, show initiative in taking on complex troubleshooting and mentoring others during incidents.

Focus Topics

Capacity Planning & Resource Management

Understanding system resource usage patterns, forecasting growth, and ensuring sufficient capacity. Know how to balance cost with reliability. Discuss strategies for detecting resource constraints early.

Practice Interview

Study Questions

System Monitoring & Observability

Designing monitoring and observability systems that help detect and diagnose problems. Understand metrics, logs, traces, and how they work together. Know what to monitor and what alert thresholds make sense. Discuss alerting best practices (avoiding alert fatigue).

Practice Interview

Study Questions

Post-Incident Reviews

Conducting blameless post-incident reviews focused on systems improvements rather than individual blame. How to document incidents, identify action items, and drive systemic improvements. Understand how post-incident reviews support organizational learning.

Practice Interview

Study Questions

Incident Response & Troubleshooting

Systematic approach to diagnosing and resolving production incidents. Know how to identify the blast radius, isolate the problem, implement temporary mitigations, and work toward permanent solutions. Understand escalation procedures and communication protocols. Be ready to discuss complex incidents you've handled.

Practice Interview

Study Questions

Root Cause Analysis

Systematic approach to understanding why incidents occurred. Techniques for drilling down from symptoms to underlying causes. Understanding the difference between immediate causes and systemic issues. Learn to ask 'why' multiple times to find root causes and systemic improvements.

Practice Interview

Study Questions

On-Site Round 4: Behavioral & Spotify Values

60 min5 focus topicsbehavioral

What to Expect

A 60-minute on-site interview focused on behavioral assessment and cultural fit. You'll be asked about how you work in teams, handle conflicts, approach learning and growth, and align with Spotify's values. This is your opportunity to demonstrate that you're a strong collaborator, can mentor others, and embrace Spotify's culture of autonomy, experimentation, and learning. The interviewer is assessing whether you'll thrive in Spotify's environment and contribute positively to team dynamics.

Tips & Advice

Research Spotify's values and culture thoroughly before the interview. Common themes include autonomy, experimentation, learning from failure, collaboration, and ownership. Prepare 3-4 examples using the STAR method that demonstrate these values: times you took ownership, learned from failures, collaborated effectively, mentored someone, or made decisions with incomplete information. Use Spotify's language and values when describing your examples. Be specific about metrics and outcomes. Show curiosity about Spotify's approach to these values. For mid-level candidates, demonstrate mentorship and cross-functional collaboration. Discuss how you balance operational excellence with enabling others to learn. Be ready to discuss a time you failed and what you learned. Be genuine and authentic—interviewers can spot if you're just saying what you think they want to hear.

Focus Topics

Communication & Influence

Clear communication of technical concepts to both technical and non-technical audiences. Ability to influence without direct authority. Examples of explaining complex problems to stakeholders or driving changes through persuasion.

Practice Interview

Study Questions

Growth Mindset & Learning

Demonstrating curiosity and commitment to continuous learning. Examples of learning new technologies, adapting to changing requirements, or growing in your role. Discuss how you mentor junior team members or help others grow.

Practice Interview

Study Questions

Problem-Solving & Decision-Making

Your approach to complex problems with incomplete information. Examples of decisions you've made under uncertainty, how you weigh trade-offs, and how you involve others in decision-making. Show comfort with pragmatism over perfection.

Practice Interview

Study Questions

Teamwork & Collaboration

Demonstrating strong collaboration skills: working effectively with development teams, product managers, and other SREs. Examples of breaking down silos, improving communication, or bridging different teams. Discuss how you approach disagreements and build consensus.

Practice Interview

Study Questions

Spotify Values & Culture Fit

Understanding Spotify's engineering culture and values including autonomy (empowering teams to make decisions), experimentation (testing ideas and learning from failures), collaboration (breaking down silos), and ownership (taking responsibility for outcomes). Be ready to discuss how your approach aligns with these values.

Practice Interview

Study Questions

Frequently Asked Site Reliability Engineer (SRE) Interview Questions

Capacity Planning and Resource OptimizationHardTechnical

22 practiced

As a senior SRE, propose a capacity governance model across several engineering teams that controls resource quotas, budgets, and SLO-driven capacity decisions. Include policy for reserved capacity requests, approval flows for large provisioning, automated enforcement using IaC, exception handling, and metrics to track adherence and effectiveness over time.

Sample Answer

Requirements & principles:- Enforce predictable cost, avoid noisy-neighbor outages, align capacity to SLOs and business priorities.- Principles: SLO-driven allocation, tiered approval, automation-first (IaC), auditability, minimal exceptions.

High-level model:1. Capacity Units & Quotas- Define standard capacity units (CU): CPU/RAM/IO/Network/Cluster-nodes per service.- Each team gets a baseline quota per environment (dev/stage/prod) tied to historical usage + safety buffer.

2. Budget & SLO alignment- Map each service to an SLO class (Critical/Important/BestEffort). SLO class determines buffer % and priority for scaling.- Budgets expressed monthly ($) and CU-hours caps. Chargeback showback ties consumption to team budget.

3. Reserved capacity & approval flow- Reserved requests submit via catalog: specify CU, duration, reason (traffic, launch), linked SLO and risk assessment.- Small (< X CU or cost Y) auto-approved if within budget and SLO justification matches.- Large requests require a review board (SRE + Finance + Product): 48-hour SLA for decision, must include rollback plan and load tests.

4. Automated enforcement using IaC- All quotas, limits, and provisioning defined in versioned IaC (Terraform/Helmfiles).- CI gated: PRs that change quotas run policy checks (OPA/Rego) and simulative cost checks; merge triggers automated apply.- Runtime admission controller (K8s OPA Gatekeeper) enforces namespace resourceQuota and limits; cluster autoscaler respects reserved node pools.

5. Exceptions & incident handling- Timeboxed exceptions via ticketing with TTL and auto-expiry; require retro review within 7 days.- Emergency fast-path: on-call SRE can grant temporary capacity with immediate postmortem and approval within 72 hours.

6. Metrics & governance tracking- Tracked daily/weekly: quota utilization (%), budget burn rate, SLO error budget burn, number of exceptions, approval latency, autoscaler effectiveness, cost per CU.- Dashboards + monthly governance review: triage hotspots, reallocate quotas, adjust SLO buffers.

Why this works:- SLO-driven prioritization ensures capacity follows reliability needs, budgets control cost, IaC + admission controls ensure consistent enforcement, and clear approval/exception flows preserve agility while maintaining governance. Continuous metrics close the loop for iterative policy tuning.

Blameless Postmortem and Organizational LearningMediumTechnical

40 practiced

You were oncall when a new deploy caused the primary database to exceed connection limits and the service degraded for three hours. Describe step-by-step how you would run the postmortem: how you'd collect evidence, structure the timeline, identify root cause versus contributing factors, list mitigations, and assign action items across teams.

Sample Answer

Situation: A new deploy caused the primary DB to exceed connection limits and the service degraded for three hours while I was on call.

Postmortem process (step‑by‑step):

1. Collect evidence (first 24 hours)- Gather timestamps of deploy, alerts, and oncall actions from PagerDuty/Slack.- Export DB metrics (connections, wait/active queries), host metrics (CPU/memory), and application metrics (request rates, error rates, latency) from Prometheus/Datadog for ±2 hours around the incident.- Pull slow query logs, DB error logs, connection pool logs, and recent schema/migration/change diffs from the deploy.- Retrieve traces from APM (e.g., Jaeger/New Relic) and relevant app logs (structured logs with request IDs).- Record versions, config files, and feature flags changed.

2. Build a precise timeline- Create a minute-level timeline: deploy start/finish, first alert, oncall responses, mitigation steps (rollback, scale up), DB symptoms, and recovery confirmation.- Annotate each event with source evidence (log lines, metric graphs, screenshots).

3. Analyze causes- Root cause: The deployed change increased per-request DB connections (e.g., removed pooling or increased concurrency), causing simultaneous connection spikes that hit DB max_connections.- Contributing factors: low DB connection limit, absent circuit-breaker/backpressure in app, lack of deploy smoke tests for connection behavior, monitoring alerts only fired after threshold breached, no automatic failover/connection queueing.

4. Distinguish root vs contributing- Root cause must directly explain the trigger (code/config introduced in deploy).- Contributing factors are systemic weaknesses that permitted escalation and prolonged recovery.

5. Mitigations (immediate & long-term)- Immediate (during incident): Rollback offending deploy, add temporary DB connection limit increase, and enable request queuing for heavy endpoints.- Short-term (1–2 weeks): Add pre-deploy smoke tests validating connection usage, tighten deploy checklist to include DB impact review, tune DB max_connections with capacity plan.- Long-term (1–3 months): Implement or enforce connection pooling libraries across services, add circuit-breakers/backpressure, automatic scaling for read replicas and connection proxies (pgbouncer), and add alerting on rate-of-change for connections.

6. Action items with owners & deadlines- Dev team that pushed change: Revert code and open PR to fix pooling; owner: Lead backend eng; due: 3 days.- SRE: Add Prometheus alert for connection spike and rate-of-change; add dashboard with runbooks; owner: SRE oncall pair; due: 48 hours.- DBA/Infra: Evaluate and safely raise DB max_connections and deploy pgbouncer on replicas; owner: DBA lead; due: 2 weeks.- QA: Add smoke test to CI that simulates high concurrency and measures DB connections; owner: QA lead; due: 1 week.- Product/Release: Update deploy checklist to require DB-impact signoff for changes touching DB code; owner: Release manager; due: next sprint.

7. Impact and follow-up- Quantify SLO impact (errors, latency, downtime minutes) and error budget burn.- Schedule a blameless review meeting to walk through timeline, confirm root cause, and commit owners to action items.- Track actions in the ticketing system; reopen postmortem if metrics show recurrence.

Lessons learned: enforce pooling and backpressure as first-line defenses, instrument deploys for resource-impact, and ensure alerts detect rising trends not just absolute thresholds.

Caching Strategies and PatternsHardSystem Design

85 practiced

Architect a global multi-region caching solution for user profile reads serving 200 million users with 95th percentile latency under 50 ms globally. Discuss active-active versus active-passive replication, read-local strategies, invalidation across regions, and how to meet consistency and availability SLOs.

Sample Answer

Requirements:- 200M users, 95th percentile read latency <50ms globally, high availability, read-mostly user profile data with occasional updates, SLOs: strong-ish consistency for profile updates (e.g., read-after-write within region or bounded staleness), 99.99% availability.

High-level architecture:- Multi-region edge caching tier + global origin DB. Regions each run: - Local read cache (in-memory, e.g., Redis Cluster or regional Caching Layer like Envoy+LRU) - Cache-population via write-through/refresh-on-miss and async invalidation - Global control plane for invalidation and metadata (Kafka/CDC + pub/sub) - Origin DB is multi-region primary (or globally replicated DB like Spanner/Cockroach) depending on consistency needs.

Active-active vs active-passive:- Active-active (writes accepted in any region) gives lowest write latency and high availability but requires conflict resolution and a strongly-consistent global DB or careful CRDTs. Use only if origin DB supports strong global consistency.- Active-passive (single write-master or region-affinity) simplifies consistency: route writes to master region, replicate asynchronously. Choose when updates are rare and global strong consistency not required.

Read-local strategies:- Serve reads from local cache (edge CDN or regional Redis) to meet <50ms.- On cache miss: either synchronous fetch from local replica of DB or fallback to nearest region origin with circuit-breakers and rate limits.- Use warm caches for hot profiles and LRUs for eviction; pre-warm based on access patterns.

Invalidation across regions:- On profile update, origin publishes change event to global pub/sub (Kafka, Cloud Pub/Sub). Each region subscribes and invalidates/updates local cache.- For write-through: update cache in-writing region immediately and publish event. Use version numbers (monotonic increment or vector clocks) to ignore out-of-order updates.- Use deduplication and batching for high QPS.

Consistency & availability trade-offs:- Define SLO tiers: read-after-write within same region (strong), global stale reads allowed with bounded staleness (e.g., 1s) for most users.- To meet availability, prefer eventual consistency with conflict detection for rare concurrent writes.- Provide optional strong reads by routing to origin or using read-repair paths.

Operational concerns:- Monitoring: latency, miss rate, invalidation lag, replication lag, error budgets; alert thresholds for 95th > 50ms and pub/sub lag.- Capacity: provision regional cache clusters sized by regional QPS and hot set; autoscale + circuit-breakers.- Testing: chaos/injection for partition scenarios, failover drills for master region.- Security & throttling: auth, rate limits, and graceful degradation (serve slightly stale profile if origin unreachable).

Trade-offs summary:- If low update rates and tolerated staleness → active-passive + read-local caches + async invalidation (simpler, highly available).- If strict global consistency required → active-active with globally-consistent DB (higher complexity/cost).

Bash and Shell ScriptingHardTechnical

43 practiced

Propose a testing strategy for a set of critical SRE Bash scripts: include unit testing with bats/shunit2, integration tests in ephemeral containers, static analysis with shellcheck, and CI pipeline steps. Provide an example unit test for a small function and describe how you'd mock external commands (e.g., systemctl, ssh) for unit tests in CI.

Sample Answer

Requirements & goals:- Fast, reliable unit tests for logic in scripts- Integration tests that exercise real system calls in isolated ephemeral containers- Static analysis to catch common shell pitfalls- CI pipeline that runs linters, unit tests, integration tests and gates merges

Strategy overview:1. Static analysis: run shellcheck (strict: SC1090/SC1091 handling of sourced files), shfmt for style.2. Unit testing: use bats (or shunit2) to test pure functions by sourcing script and mocking external commands. Keep functions small and pure.3. Integration testing: run tests in ephemeral containers (Docker/Podman) or ephemeral VMs that mirror prod (systemd vs OpenRC); mount only required sockets, run real systemctl/ssh where safe.4. CI pipeline: stages: lint -> unit -> build image -> integration -> publish; fail fast; cache dependencies.

Example unit test (bats) for a function that checks service status:

bash

# tests/service_status.bats
load 'test_helper/bats-support/load'
setup() {
  # put mocks earlier on PATH
  export PATH="$(pwd)/test_mocks:$PATH"
  source ../scripts/utils.sh
}
@test "is_service_active returns 0 when systemctl shows active" {
  run is_service_active "nginx"
  [ "$status" -eq 0 ]
  [ "$output" = "active" ]
}

Mock example (test_mocks/systemctl):

bash

#!/bin/bash
if [[ "$1" == "is-active" && "$2" == "nginx" ]]; then
  echo "active"
  exit 0
fi
echo "inactive"
exit 3

Mocking approaches:- PATH shadowing: put mock binaries in test_mocks/ earlier in PATH so scripts call them instead of real commands.- Function overrides: source script in the test and override commands as shell functions (e.g., systemctl() { echo active; })- Use bats-mock or stub utilities to assert call args.

Integration tests:- Build a test image with the scripts installed.- Run container with --privileged=false; for systemctl tests use systemd-enabled container or use systemd-nspawn; for ssh tests spin up an ephemeral SSH server container and test connectivity/behavior.- Tear down networks and containers after each test to ensure isolation.

CI pipeline (example GitHub Actions):- name: Lint -> run shellcheck, shfmt check- name: Unit tests -> set up minimal runner, run bats; use matrix for bash versions- name: Build -> build container image- name: Integration tests -> run docker-compose or matrix of images; require secrets as repo secrets; run tests in isolated runners- name: Publish -> on success, tag artifacts

Observability & safety:- Add verbose/log flags to scripts to capture outputs in CI artifacts.- Use timeouts for long-running integration tests.- Keep unit tests hermetic; integration tests only in protected branches.

Trade-offs:- PATH-mocking is simple and robust for unit tests but doesn't catch integration edge cases; ephemeral container tests add confidence but cost more time—run them on merge to main only.

Decision Making Under UncertaintyEasyTechnical

41 practiced

A frequently noisy alert stems from a metric with very high cardinality (many tag combinations). Describe practical short-term and medium-term changes you would implement to reduce false positives and alert fatigue while preserving meaningful signal.

Collaboration and Communication SkillsEasyTechnical

75 practiced

You need to prepare a 10-minute incident briefing for a mixed audience: engineers, customer success, and executives. Describe the content and structure of your briefing so that each audience gets the information they need without being overwhelmed. Include suggested visuals and handoff points for deep dives.

Capacity Planning and Resource OptimizationHardTechnical

24 practiced

Implement (or outline) a simplified capacity simulator in Python that, given a time-series of arrival rates, a service time distribution, an initial number of servers, and a simple scaling policy (add/remove servers based on average utilization), simulates queue lengths and latencies over time. Describe design choices (discrete-time step vs event-driven), data structures, and limitations of your simulator.

Sample Answer

Approach summary:- Choose a discrete-time simulator (fixed tick dt). It's simpler to implement and good for capacity-planning pulses; I explain trade-offs with event-driven below.- Model: arrivals per tick ~ Poisson(rate * dt). Service times ~ Exponential(mean_service). Servers process jobs FIFO; each server stores remaining_time or is idle.- Autoscaling: sample average utilization over a sliding window; if > high threshold add a server (up to max), if < low threshold and cooldown passed remove a server.

Python implementation (simplified, readable):

python

import random, math, collections

def poisson(k_lambda):
    L = math.exp(-k_lambda); p=1.0; k=0
    while p > L:
        p *= random.random(); k += 1
    return k-1

def simulate(arrival_rates, mean_service, init_servers,
             dt=1.0, window=30, high=0.75, low=0.25,
             cooldown=60, max_servers=100, total_time=None):
    if total_time is None: total_time = len(arrival_rates)*dt
    t=0.0
    servers = [0.0]*init_servers   # remaining service time per server; 0 = idle
    queue = collections.deque()
    util_history = collections.deque(maxlen=int(window/dt))
    stats = {"time":[], "queue_len":[], "avg_latency":[]}
    latencies=[]
    last_scale = -1e9
    i_rate = 0
    while t < total_time:
        rate = arrival_rates[min(i_rate, len(arrival_rates)-1)]
        arrivals = poisson(rate*dt)
        for _ in range(arrivals):
            queue.append({"arrival":t, "service": random.expovariate(1.0/mean_service)})
        # assign work to idle servers
        for s in range(len(servers)):
            if servers[s] <= 0 and queue:
                job = queue.popleft()
                servers[s] = job["service"]
                job["start"] = t
                job["assigned_time"]=t
                # store start time on the server by adding to a list mapping if needed
                # for latency tracking we'll append expected completion with start time below
                latencies.append((t+servers[s], job["arrival"]))
        # advance time: deduct dt from running servers; collect completed jobs
        for s in range(len(servers)):
            if servers[s] > 0:
                servers[s] -= dt
        # compute utilization (fraction of busy servers)
        busy = sum(1 for x in servers if x>0)
        util = busy / max(1, len(servers))
        util_history.append(util)
        avg_util = sum(util_history)/len(util_history)
        # scaling decision
        if (t - last_scale) >= cooldown:
            if avg_util > high and len(servers) < max_servers:
                servers.append(0.0); last_scale=t
            elif avg_util < low and len(servers) > 1:
                # only remove idle server
                for idx in range(len(servers)-1, -1, -1):
                    if servers[idx] <= 0:
                        servers.pop(idx); last_scale=t; break
        # record stats
        # compute avg latency of completed jobs up to now
        completed = [ (comp,arr) for comp,arr in latencies if comp <= t ]
        if completed:
            avg_lat = sum(comp-arr for comp,arr in completed)/len(completed)
            # drop completed from list
            latencies = [x for x in latencies if x[0] > t]
        else:
            avg_lat = 0.0
        stats["time"].append(t)
        stats["queue_len"].append(len(queue))
        stats["avg_latency"].append(avg_lat)
        t += dt
        i_rate += 1
    return stats

Key concepts and reasoning:- Discrete-time is simpler to code and aligns with sampling-based autoscaling; event-driven is more accurate and efficient when load is sparse or services have long tails because it advances directly to next event.- Servers store remaining service time (O(num_servers)); queue is deque for O(1) enqueue/dequeue.- Poisson arrivals + exponential services approximate M/M/c queues; results useful for capacity planning.

Complexity:- Each tick O(num_servers + arrivals). Memory O(num_servers + queue_length).

Edge cases & limitations:- Time-step dt must be small relative to service times to reduce discretization error (trade-off runtime).- Exponential service is simplistic; real workloads may have heavy tails.- No multi-class priorities, no warm-up/startup cost for servers, no provisioning delay (could be added).- Scaling hysteresis/cooldown implemented but not predictive; event-driven simulator would be better for latency percentiles and rarer events.

Blameless Postmortem and Organizational LearningHardTechnical

56 practiced

You must perform forensic investigation for an incident where critical logs were rotated and deleted before review. List technical sources and process strategies to reconstruct the timeline and root cause when evidence is partially missing, and explain how to document uncertainty in the postmortem.

Sample Answer

Start by treating this as a missing-evidence forensic investigation: preserve what remains, then reconstruct using multiple independent sources and explicit uncertainty notation.

Technical sources to collect and correlate- On-host: systemd-journal (journalctl), /var/log, rotated archives, filesystem metadata (mtime/ctime/atime), inode change logs, ext4/xfs journals, LVM snapshots, deleted-file carving (bulk_extractor, photorec), process accounting (acct), kernel logs (dmesg), auditd/audit.log.- Network & perimeter: packet captures (pcap), load balancer logs, firewall/WAF logs, IDS/IPS alerts, DNS logs, CDN logs, SMTP/relay logs.- Cloud/Platform: CloudTrail / GCP Audit Logs / Azure Activity Logs, S3/object storage access logs, KMS/audit, IAM logs, control-plane events, autoscaler activity.- Observability: Prometheus metrics, Alertmanager history, Grafana snapshots, tracing spans (Jaeger/Zipkin), APM logs.- CI/CD & automation: deployment logs (Jenkins/GitHub Actions), configuration management state (Ansible/Chef), rotation/retention scripts, cron jobs.- Endpoints & third parties: endpoint EDR, developer machines, partner logs, log-shipper logs (Fluentd/Vector) and brokers (Kafka) offsets/commits.

Process strategies to reconstruct timeline and root cause1. Preserve and image: take forensic images/snapshots (disk, memory) and immutable copies of all logs and configs encountered; record hashes and chain-of-custody.2. Build a layered timeline: ingest all timestamps into a timeline tool (Plaso/Timesketch); normalize timezones and correct clock skew using NTP/chrony logs and secure time sources.3. Cross-validate events: correlate independent sources (e.g., CloudTrail event + pcap + host process start) to elevate confidence; mark events only seen in single-source as low-confidence.4. Recover deleted data: attempt file-carving on filesystem images, check rotated archives on object storage or backups, examine log-forwarder persistence (local buffers, Kafka offsets).5. Inspect rotation mechanics: review rotation config (logrotate, journald settings), retention policies, scripts and recent changes (git commits, recent deploys) to detect root cause (buggy rotation, misconfigured retention, malicious action).6. Hypothesis-driven analysis: formulate hypotheses (accidental rotation vs malicious deletion vs failed shipper), enumerate expected observable evidence for each, test against collected data, iterate.7. Reproduce safely: in an isolated lab, recreate rotation/deletion using the same versions/configs to confirm behavior without altering evidence.8. Root-cause synthesis: combine causal chain (trigger → system action → outcome) and identify contributing factors (automation, missing alerts, permissions).

Documenting uncertainty in the postmortem- Use explicit confidence levels per event (high/medium/low) and provenance tags (primary/secondary/inferred).- Show the evidence tree: for each asserted fact include sources, timestamps, hashes, and whether it was directly observed, recovered, or inferred.- List gaps and assumptions: state what was missing, what you assumed to fill gaps, and how each assumption affects conclusions.- Present alternative hypotheses: explain which were rejected and why, and which remain plausible.- Actionable remediation with risk/priority: recommend concrete fixes and note which depend on unverified assumptions.- Append reproducible commands, scripts, and forensic artifacts (or secure references) so reviewers can verify findings.

Example excerpt style: "2025-11-09T14:03:21Z — hostA: logrotate executed (source: /var/log/logrotate.status, hash=...). Confidence: high. Corroborating: CloudTrail StartInstances at 14:03:19Z (secondary). Missing: rotated archive file not present on host; attempted carving returned fragments (low confidence). Assumption: rotation script uploaded to S3 retention policy set to 7d — unverified."

This approach ensures a defensible timeline, clear traceability, and transparent handling of uncertainty for engineers, auditors, and leadership.

Caching Strategies and PatternsHardSystem Design

70 practiced

Plan a migration from memcached to a redis-cluster for a production cache with zero downtime. Describe data migration strategy, synchronization approaches, dual-write or dual-read strategies, cutover verification checks, rollback procedures, and how to handle inconsistent data during the migration window.

Sample Answer

Requirements & constraints:- Zero downtime, preserve cache correctness where possible, bounded performance impact, safe rollback, support large keyspace and TTLs, maintain SLOs.

High-level plan:1. Provision Redis Cluster in parallel (capacity, sharding, persistence/replica settings, monitoring, ACLs).2. Migrate traffic with blue/green-like approach using dual-write + dual-read verification, then cutover.

Data migration strategy:- Cold-sync: snapshot memcached keys by scanning application key registry or by instrumenting memcached client to emit keys; if no registry, use periodic key dump from memcached instances (lru crawls or wrappers).- Bulk-load: write dumped keys into Redis cluster preserving TTLs using batched, pipelined writers with rate limiting.- Continuous-sync: capture mutations during bulk-load via dual-write at application layer or a write-proxy.

Synchronization approaches:- Dual-write (primary): modify application clients to write to both memcached and Redis atomically (best-effort with retry). Implement idempotent writes, non-blocking writes to secondary (async buffered).- Change-capture fallback: if available, tap memcached mutation log (if instrumented) or wrap client library to emit events to a CDC queue (Kafka) to replay to Redis.

Dual-read strategies:- Read-through verification phase (canary): on read, check Redis first; if miss, fall back to memcached. When both exist, compare checksums for sampling percentage (e.g., 1%->10%->100%) and log mismatches. Use probabilistic sampling to avoid perf impact.- Progressive ramp: start with a small % of traffic doing dual-read checks; increase as confidence grows.

Cutover verification checks:- Consistency metrics: sampled key checksum mismatch rate < threshold (e.g., 0.1%).- Hit rates and latencies: Redis p99 latency within SLO and cache hit-rate comparable.- Error rates: application and Redis errors within baseline.- Capacity: memory usage and eviction rates stable.- End-to-end smoke tests for critical flows.

Cutover steps:1. Start dual-write + bulk-load + continuous-sync.2. Enable dual-read sampling; monitor mismatch metrics and system health.3. If stable, switch read-primary to Redis for all traffic (fast config flip or feature flag).4. Monitor closely for 24-72 hrs; then decommission memcached.

Rollback procedures:- If thresholds breach, flip reads back to memcached (feature flag) and continue dual-write until issues resolved.- For writes: since dual-write wrote to both, memcached remains authoritative until full cutover. If some writes missed in Redis, replays from CDC queue or request logs to backfill.- Maintain memcached for a holdback period (e.g., 7 days) before teardown.

Handling inconsistent data:- Tolerate transient inconsistency during migration window; aim for convergence.- For mismatches: log keys, apply reconciliation jobs that compare and repair (prefer Redis value if policy chooses).- Design idempotent key semantics so last-writer-wins is safe.- For critical keys (sessions/locks): route exclusively to memcached or to a transactional store until verified.- Use TTL alignment and avoid negative caching during migration.

Operational considerations:- Feature flags to control dual-write/dual-read and instant rollbacks.- Rate-limited bulk loaders, circuit breakers for Redis.- Observability: dashboards for mismatch rate, latency, errors, evictions, memory, and throughput. Alerts on anomalies.- Runbook documenting decision thresholds and rollback commands.- Post-migration: tombstone period, final reconciliation, decommission steps, and a postmortem.

This plan minimizes downtime by running systems in parallel, using progressive verification, and keeping an easy rollback path while ensuring observability and automated reconciliation.

Bash and Shell ScriptingEasyTechnical

35 practiced

Explain how exit codes work in Unix shell scripts. How do you check the exit status of the last command and how do you capture exit statuses of commands in a pipeline? Provide examples using '$?' and Bash's 'PIPESTATUS' (or alternatives in POSIX shells). Also describe common conventions for exit codes and their meanings.

Sample Answer

Exit codes are small integers returned by a process to indicate success or error. By convention 0 means success; non‑zero means failure. In shell scripts you check the last command’s exit status with $?. For pipelines, Bash exposes an array PIPESTATUS giving each stage’s exit code. POSIX shells don’t provide PIPESTATUS, but you can use set -o pipefail to make the pipeline’s exit code reflect a failing stage (or capture individual statuses with temporary variables or process substitution).

Examples:

bash

# simple check using $?
cp src dest
if [ $? -ne 0 ]; then
  echo "copy failed"
fi

# better style: use immediate test
if ! cp src dest; then
  echo "copy failed"
fi

Pipeline examples (Bash):

bash

# Without pipefail, pipeline exit is last command
false | true
echo $?        # prints 0

# Use pipefail so pipeline fails if any stage fails
set -o pipefail
false | true
echo $?        # prints 1 (exit of failed stage)

# Inspect all stages with PIPESTATUS
false | true
echo "${PIPESTATUS[@]}"   # prints: 1 0

POSIX alternative to get each stage’s status: run stages in subshells and capture their exit codes or write to temporary files; or use process substitution and run commands sequentially if order allows.

Common conventions:- 0: success- 1: general errors- 2: misuse of shell builtins (shell reserved)- 126: command invoked cannot execute- 127: command not found- 128+n: terminated by signal n (e.g., 130 = 128+2 for Ctrl‑C)- Application-specific codes: document them in README or --help

Best practices for SRE:- Prefer checking commands immediately (avoid later $?)- Use set -e and set -o pipefail carefully in scripts (explicit error handling is clearer)- Log errors and map specific failure codes to alerts/runbooks so operators can act quickly.

Practice Site Reliability Engineer (SRE) questions across all topics

Additional Information

Want to create your own tailored preparation guide using our deep research?

Get Started for Free

Interview-Ready Courses

Visual-first, interactive, structured learning paths

Browse Site Reliability Engineer (SRE) jobs

AI-enriched listings across hundreds of company career pages

Explore Jobs

Spotify Site Reliability Engineer (Mid-Level) Interview Preparation Guide

Interview Process Overview

Interview Rounds

Recruiter Screening

What to Expect

Tips & Advice

Focus Topics

Concrete Project Examples

Practice Interview

Study Questions

Communication & Collaboration Skills

Practice Interview

Study Questions

Spotify Culture & Values Alignment

Practice Interview

Study Questions

Career Motivation & SRE Path

Practice Interview

Study Questions

Technical Phone Screen

What to Expect

Tips & Advice

Focus Topics

Container & Orchestration Basics

Practice Interview

Study Questions

Incident Response Basics

Practice Interview

Study Questions

System Performance & Troubleshooting

Practice Interview

Study Questions

Networking Fundamentals

Practice Interview

Study Questions

Linux Systems & Administration

Practice Interview

Study Questions

Shell Scripting & Automation

Practice Interview

Study Questions

System Design Phone Screen

What to Expect

Tips & Advice

Focus Topics

SLOs, SLIs, and Error Budgets

Practice Interview

Study Questions

Reliability Patterns & Fault Tolerance

Practice Interview

Study Questions

Trade-offs in System Design

Practice Interview

Study Questions

Building Scalable Infrastructure

Practice Interview

Study Questions

Designing Monitoring & Alerting Systems

Practice Interview

Study Questions

On-Site Round 1: Infrastructure & Automation

What to Expect

Tips & Advice

Focus Topics

Programming for Operations

Practice Interview

Study Questions

Deployment Automation & Orchestration

Practice Interview

Study Questions

Configuration Management

Practice Interview

Study Questions

Automation Frameworks & Tools

Practice Interview

Study Questions

Infrastructure as Code (IaC)

Practice Interview

Study Questions

On-Site Round 2: System Design & Reliability Architecture