Apple Site Reliability Engineer (Senior Level) - Comprehensive Interview Preparation Guide

Site Reliability Engineer (SRE)

Apple

Senior

6 rounds

Updated 6/17/2026

Apple's Site Reliability Engineer interview process for Senior-level candidates is comprehensive and spans approximately 6 months from initial application to offer. The process includes a recruiter screening phase followed by a virtual on-site with multiple technical rounds focused on systems internals, networking fundamentals, coding/algorithms, system design, and behavioral assessment. Each round includes behavioral evaluation components. The interview emphasizes depth of knowledge in distributed systems, Linux fundamentals, observability, and system design with particular focus on load balancing and reliability at scale.

Interview Rounds

Recruiter Screening

30 min2 focus topicsculture fit

Systems Internals Deep Dive

60 min5 focus topicstechnical

What to Expect

Technical round (60 minutes) focused on deep Linux knowledge and system troubleshooting. Expect a realistic Linux troubleshooting scenario (e.g., SSH not working with console access). The interviewer will guide you through diagnosis using Linux tools and will probe your understanding of /proc filesystem, memory management, process management, and how shell commands are interpreted. This round assesses your foundational expertise in systems administration, ability to think through problems systematically, and depth of Linux knowledge required for production reliability work.

Tips & Advice

Before the interview, review the Linux boot process, process management, memory management (heap vs. stack, page tables), the /proc filesystem structure, and common Linux troubleshooting tools. Practice debugging a real scenario where you can't SSH into a machine - think through what you'd check first, how you'd gather information from /proc, how you'd interpret system calls with strace. Be comfortable discussing how the shell interprets commands, environment variables, and file descriptors. For Senior level, explain not just how to fix the problem but how you'd prevent it and monitor for it in production. Ask clarifying questions about the environment when given a scenario.

Focus Topics

Shell Interpretation and Command Execution

Understanding of how shells parse and execute commands, including quoting, expansions (glob, variable, command substitution), piping, redirection, and background processes. Know how environment variables are inherited, how file descriptors work, and how subshells behave.

Practice Interview

Study Questions

System Call Interface and Kernel-User Space Interaction

Understanding of what system calls are, how applications interact with the kernel, and how to trace system calls with strace. Know common system calls related to process management, file I/O, and networking. Understand the difference between user space and kernel space.

Practice Interview

Study Questions

Linux Process Management and /proc Filesystem

Deep understanding of how processes work in Linux, including process states, memory layouts, file descriptors, and how to inspect processes via /proc. Know how to read /proc/[pid]/status, /proc/[pid]/maps, /proc/meminfo, and interpret this information to diagnose issues. Understand process scheduling, context switching, and CPU affinity.

Practice Interview

Study Questions

Linux Troubleshooting Methodology and Tools

Systematic approach to Linux troubleshooting using strace, lsof, /proc inspection, dmesg, and other tools. Ability to narrow down where a problem exists (kernel, application, network, permissions, etc.) and use appropriate tools to investigate. Understanding of file descriptor management, socket states, and connection issues.

Practice Interview

Study Questions

Linux Memory Management and Virtual Memory

Understanding of physical vs. virtual memory, paging, swapping, memory mapping, and the page cache. Know how to interpret memory usage from /proc, understand OOM killer behavior, and diagnose memory-related performance issues. Understand memory isolation and how memory is allocated at the kernel level.

Practice Interview

Study Questions

SRE/Networking Deep Dive

60 min6 focus topicstechnical

What to Expect

Technical round (60 minutes) focused on networking protocols and distributed systems. Expect deep questions about TCP, TLS, HTTP, and DNS. You may be asked to walk through the complete request flow to a service like icloud.com, explaining each layer. The interviewer will probe your understanding of networking concepts, protocol interactions, and how these impact reliability and observability. For Senior level, expect questions about how networking issues manifest in production, how to monitor networking health, and how to design for network reliability.

Tips & Advice

Study the OSI model with deep focus on layers 3-7. Understand TCP in detail: connection states (SYN, SYN-ACK, ACK, TIME-WAIT), window size, retransmission, congestion control. Understand DNS - query flow, caching, TTL implications, A/AAAA records. Understand TLS - handshake, certificate validation, cipher suites. Understand HTTP - status codes, headers, connection management, keep-alive. Practice walking through a complete request: DNS lookup (with caching), TCP connection establishment, TLS handshake, HTTP request/response. For Senior level, discuss how each layer can fail, what metrics to monitor, and how to design systems resilient to networking issues. Be able to explain network troubleshooting tools like tcpdump, netstat, dig, curl and how you'd use them to diagnose issues. Think about load balancing implications of your networking knowledge.

Focus Topics

HTTP Protocol and Web Communication

Deep understanding of HTTP methods, status codes, headers, and connection management (HTTP/1.0 vs HTTP/1.1 keep-alive vs HTTP/2 vs HTTP/3). Understand caching headers, compression, and how these impact performance and reliability.

Practice Interview

Study Questions

Network Troubleshooting and Observability

Practical use of networking tools: tcpdump, netstat, ss, dig, nslookup, curl, wget. Understanding of metrics to monitor: packet loss, latency, connection establishment time, DNS resolution time, TLS handshake time. Knowing how to set up alerts and dashboards for network health.

Practice Interview

Study Questions

Network Request Flow and Distributed System Communication

Ability to trace a request through all layers: DNS resolution (with caching), TCP connection establishment, TLS handshake, HTTP request, processing, and response. Understanding of how failures at each layer manifest and what signals indicate problems. For Apple services, understanding iCloud request flow or similar.

Practice Interview

Study Questions

TLS/SSL Protocol and HTTPS

Understanding of TLS handshake, certificate validation, cipher suites, and how TLS impacts latency and connection setup time. Understand certificate pinning, certificate revocation, and common TLS-related issues in production. Know how TLS 1.2 and 1.3 differ.

Practice Interview

Study Questions

TCP Protocol and Connection Management

Deep understanding of TCP including the three-way handshake, connection states (LISTEN, SYN_SENT, SYN_RECEIVED, ESTABLISHED, FIN_WAIT_1, FIN_WAIT_2, CLOSE_WAIT, TIME_WAIT, CLOSED), window size management, retransmission logic, and congestion control (slow start, congestion avoidance). Understand TIME_WAIT implications for connection reuse and ephemeral port exhaustion.

Practice Interview

Study Questions

DNS Resolution and Caching

Understanding of DNS query flow, record types (A, AAAA, CNAME, MX, etc.), caching at multiple levels (resolver cache, OS cache, application-level caching), TTL implications, and DNS-related failure modes. Understand how DNS problems can cascade into application failures.

Practice Interview

Study Questions

Coding/Algorithms Assessment

60 min4 focus topicstechnical

What to Expect

Coding round (45-60 minutes) where you'll solve 1-2 LeetCode-style problems at Easy to Medium difficulty, typically involving data structures like graphs (BFS/DFS traversal). You'll write code in your language of choice and explain your approach. The interviewer is assessing algorithmic thinking, code quality, ability to handle edge cases, and communication while coding. For Senior level, interviewers expect clean, well-structured code and thoughtful discussion of trade-offs.

Tips & Advice

Practice LeetCode Medium problems, particularly those involving graphs and tree traversal (BFS, DFS). Be comfortable coding in your preferred language - don't attempt to code in a language you're not fluent in. Write clean, readable code with meaningful variable names. Walk through your approach before coding - ask clarifying questions about constraints (input size, etc.). Discuss time and space complexity. Handle edge cases explicitly. For Senior level, think about optimization opportunities and discuss trade-offs. Test your code mentally with sample inputs. If stuck, communicate your thinking clearly and consider simpler approaches first.

Focus Topics

Code Quality and Communication

Writing clean, readable, well-structured code with meaningful variable names and comments where necessary. Walking through your approach clearly before coding. Explaining your logic and decisions as you code. Discussing edge cases and handling them explicitly.

Practice Interview

Study Questions

Algorithm Complexity Analysis

Ability to analyze and articulate the time and space complexity of algorithms using Big O notation. Understand trade-offs between time and space. Be able to optimize algorithms and explain the improvements.

Practice Interview

Study Questions

Data Structures Fundamentals

Solid understanding of fundamental data structures: arrays, linked lists, stacks, queues, hash tables, trees, and heaps. Know the time/space complexity of operations and when to use each. Be comfortable implementing basic versions of these.

Practice Interview

Study Questions

Graph Algorithms (BFS and DFS)

Deep understanding of breadth-first search and depth-first search algorithms. Know how to implement both iteratively and recursively. Understand use cases for each approach and be able to solve problems involving graph traversal, connected components, shortest path, and tree traversal.

Practice Interview

Study Questions

System Design Round

75 min6 focus topicssystem design

What to Expect

System design round (60-75 minutes) where you'll design a large-scale distributed system. You may be asked to design something like a GitHub clone or similar service with focus on specific aspects like load balancing, observability, and reliability. You'll discuss architecture, components, data flows, and trade-offs. The interviewer will probe your thinking and likely ask follow-up questions about handling specific challenges. For Senior level, demonstrate deep understanding of distributed systems, ability to think through failure modes, and design for observability from the ground up.

Tips & Advice

Start by clarifying requirements and constraints - ask about scale (users, QPS, data volume), geography, consistency requirements, and what matters most (availability vs. consistency). Propose a high-level architecture with main components. For each component, discuss how it scales and where failures can occur. Design for observability from the start - what metrics, logs, and traces will you collect? Discuss load balancing strategy across components. Think about database choice and trade-offs. Discuss caching strategies. Address reliability: how do you handle component failures, how do you do deployments without downtime, what's your SLO? For Apple's focus on observability, emphasize how you'd monitor this system to understand its health and behavior. Be prepared to dive deep into one area based on interviewer's questions. Show your thinking process, don't just present a solution.

Focus Topics

Database Design and Trade-offs

Understanding relational vs. NoSQL databases and when to use each. Thinking through consistency models (strong, eventual), replication strategies, sharding, and backup/recovery. Discussing performance implications and trade-offs.

Practice Interview

Study Questions

Deployment, Rollback, and Change Management

How you'd deploy changes safely: blue-green deployments, canary deployments, staged rollouts. How you'd roll back if something goes wrong. Minimizing blast radius of changes. Coordinating changes across multiple services.

Practice Interview

Study Questions

Handling Failure Modes and Resilience

Thinking through what can fail (server crashes, network partitions, storage failures, etc.) and how you'd handle each. Designing for graceful degradation, failover, redundancy. Understanding CAP theorem and consistency implications. Designing recovery procedures.

Practice Interview

Study Questions

Observability and Monitoring Design

Designing systems to be observable from the start: what metrics would you collect (latency, error rate, throughput, resource utilization)? What logs would you generate? How would you instrument requests to trace them across services? Designing alerts that indicate real problems. Understanding of SLIs, SLOs, and error budgets.

Practice Interview

Study Questions

Load Balancing Strategies and Techniques

Understanding of load balancing approaches (round-robin, least connections, consistent hashing, etc.) and when to use each. Understanding of load balancing at different layers (L4 vs L7). Designing systems that distribute load effectively and handle load balancer failures. Understanding sticky sessions and their implications.

Practice Interview

Study Questions

Distributed System Architecture Design

Ability to design scalable architectures with multiple components: load balancers, API servers, databases, caches, message queues, etc. Understanding of service-oriented architecture, microservices, and when to split systems. Thinking through communication patterns between services and consistency implications.

Practice Interview

Study Questions

Behavioral and Leadership Interview

45 min6 focus topicsbehavioral

What to Expect

Interview round (45-60 minutes) focused on behavioral assessment, leadership, and cultural fit. Expect questions about past experiences handling incidents, making trade-offs, collaborating with teams, and influencing decisions. The interviewer (often a manager or senior engineer) will probe your approach to problem-solving, how you handle pressure, your communication style, and how you work with others. For Senior level, expect deeper questions about mentoring, project leadership, and how you balance competing priorities. This round also includes your opportunity to ask questions about the team, role, and Apple's SRE culture.

Tips & Advice

Prepare specific stories using the STAR method (Situation, Task, Action, Result) for: a major incident you handled, a reliability problem you solved, a time you collaborated effectively across teams, a time you had to make a trade-off between speed and reliability, a time you mentored someone, and a time you learned from a mistake. For Senior level, emphasize your leadership approach, how you influence teams, and how you think about technical strategy. Have concrete metrics or outcomes for your stories. Prepare thoughtful questions about Apple's SRE practices, the team's current reliability challenges, and how the role contributes to the organization. Research Apple's focus on reliability and user experience, and connect your approach to those values.

Focus Topics

Reliability Engineering Philosophy and Strategy

Your perspective on what makes systems reliable, how to approach reliability holistically, and your vision for SRE practices. For Senior level, discuss how you've influenced reliability culture in previous roles and your strategic thinking about reliability.

Practice Interview

Study Questions

Problem-Solving Approach and Learning from Failures

Describing your systematic approach to solving complex problems: how you break down unknowns, how you gather information, how you test hypotheses. Show examples of difficult problems you've solved. Discuss times you've failed and what you learned.

Practice Interview

Study Questions

Reliability Trade-offs and Decision-Making

Ability to discuss situations where you balanced competing priorities: speed to market vs. reliability, cost vs. redundancy, automation effort vs. manual work, etc. Show systematic thinking about trade-offs and willingness to make pragmatic decisions based on context.

Practice Interview

Study Questions

Cross-functional Collaboration and Communication

Examples of working effectively with development teams, product managers, and other disciplines. Ability to communicate complex technical issues to non-technical audiences. Demonstrating that you can influence decisions and drive change across teams.

Practice Interview

Study Questions

Technical Mentoring and Leadership

For Senior level, describe your approach to mentoring junior engineers: how you help them grow, how you delegate, how you ensure they have learning opportunities. Show examples of engineers you've mentored and their growth. Discuss how you approach leading projects and influencing team decisions.

Practice Interview

Study Questions

Incident Response and Post-Incident Learning

Ability to describe your approach to incident response: how you identify the problem, coordinate resolution, communicate with stakeholders, and conduct blameless post-mortems. For Senior level, discuss how you've led incident response, mentored junior engineers through incidents, and used incidents as learning opportunities. Show understanding that incidents are learning opportunities and shouldn't result in blame.

Practice Interview

Study Questions

Frequently Asked Site Reliability Engineer (SRE) Interview Questions

Cross Functional Collaboration and CoordinationMediumTechnical

39 practiced

You must negotiate an error budget policy with multiple product teams that have differing risk tolerances: some want continuous deployments while others prefer stability. Create a negotiation approach that proposes metrics to measure burn, governance for spending the error budget, rollback conditions, exemptions, and how you will track adherence over time.

Sample Answer

Situation: Different product teams need a single, practical error-budget policy but have different risk appetites—some want continuous deploys, others prioritize stability.

Approach (negotiation + implementation):1. Align goals first- Workshop with stakeholders to agree SLO goals per service class (business-critical, revenue-facing, non-critical).- Map risk tolerance to SLO targets (e.g., 99.95% for critical, 99.9% for less critical).

2. Metrics to measure burn- Error budget remaining (%) = 1 - (observed SLO breach / SLO window)- Burn rate = (budget consumed in last X hours) / (budget allocated for window)- User-impacted minutes/hours, incident count by severity, MTTR- Canary failure rate, rollout failure percentage

3. Governance for spending the budget- Automated gates: if burn_rate > threshold (e.g., 2x expected) for N minutes, suspend risky activities (e.g., feature flags, mass deploys).- Approval tiers: minor spend (<5% budget) auto-approved; medium (5–20%) requires TPM + SRE OK; >20% requires product lead sign-off and war room.- Spend framework: experiments/timeboxed rollouts permitted up to cap; all spends logged with justification and rollback plan.

4. Rollback conditions- Automatic rollback triggers: canary failure rate > X% OR user-impact minutes exceed Y in T minutes.- Progressive rollback policy: step back to last healthy canary, freeze further canary promotions until root cause analysis.- Manual override path for emergency releases with documented risk acceptance.

5. Exemptions- Planned maintenance and scheduled migrations allowed but must be pre-declared, timeboxed, and excluded from SLO windows or compensated with temporary SLO adjustments.- Pilot experiments: allow narrow-scope exemptions (small percentage of users) with strict caps and SRE monitoring.

6. Tracking adherence over time- Single dashboard showing SLOs, budget remaining, burn rate per team, active exemptions, and recent spends.- Weekly reliability review for teams exceeding thresholds; monthly executive report summarizing trends and root-cause actions.- Quarterly policy retrospective to tune thresholds, based on incident postmortems and business impact.

Why this works:- Balances autonomy (continuous deployment for tolerant teams via canaries/feature flags) with protection for conservative teams (stricter SLOs and automated gates).- Combines automated enforcement with human governance for significant risks.- Transparent metrics and regular reviews create feedback loops to negotiate adjustments objectively.

Fault Tolerance and System ResilienceMediumTechnical

65 practiced

Design the configuration and alerting strategy for a circuit breaker guarding calls to a flaky downstream service. Specify metrics to track (error-rate, latency, volume), thresholds for tripping, and auto-recovery behavior. Explain how you would avoid oscillation and ensure human-readable alerts.

Sample Answer

Situation & goal: Protect callers from a flaky downstream service while keeping it available when healthy. Use a circuit-breaker with metric-driven thresholds, smoothing to avoid oscillation, progressive auto-recovery, and clear, actionable alerts.

Metrics to track (per host/endpoint and aggregated):- Error rate (% of failed calls = 5xx, timeouts, connection errors) over sliding windows (1m, 5m).- Latency P50/P95/P99 and % above SLO (e.g., >500ms).- Volume (requests/sec) and concurrent in-flight requests.- Successful half-open probe result rate.

Example thresholds (tunable per service):- Trip to OPEN when: error_rate_1m ≥ 20% AND total_volume_1m ≥ 50 requests OR P95 latency_1m ≥ 1s for 1m.- Minimum open time: 30s (prevent chattering).- Backoff & auto-recovery: after min open time, enter HALF-OPEN and allow a small probe ratio (e.g., 5% of traffic or 5 probe requests). If probes success_rate ≥ 80% over 1m, transition to CLOSED; else double open duration (exponential backoff up to cap, e.g., 10m) and retry.- Circuit state should be persisted briefly to survive restarts (in-memory + shared cache for cluster-wide behavior).

Avoiding oscillation:- Use dual-window checks (short and medium windows) and require both to indicate failure before tripping (e.g., 1m spike must coincide with 5m degradation).- Hysteresis: different thresholds for opening vs closing (open at 20% error, close only when <5% for 5m).- Minimum open time + exponential backoff + randomized jitter on retry timings.- Rate-limit probe traffic and use quorum on multi-instance deployments.

Human-readable alerts:- Alert title: "[P1] Circuit OPEN — payments-service -> billing (region-us-east)". Include state, scope, affected endpoints, and last 5m error rate and P95.- Severity mapping: P1 for full open affecting SLO, P2 for degraded (error_rate 10–20%), P3 for elevated latency.- Body: concise impact, evidence (metrics and thresholds hit), runbook link with immediate actions (e.g., rollback, scale, investigate downstream), service owner and suggested commands/log queries.- Avoid alert noise: only fire alerts when circuit opens (and when it remains open beyond escalation windows) and send recovery/clear alerts when stable closed for X minutes.- Include automated paging for P1, and aggregate related circuits into a single incident when same downstream is failing.

Why this works: combines short/medium windows and hysteresis to avoid reacting to spikes, uses safe probe strategy to recover progressively, and provides actionable, human-friendly alerts so responders know impact and next steps.

Database Selection and Trade OffsMediumTechnical

45 practiced

Compare managed relational offerings (AWS RDS Postgres, Google Cloud Spanner, Azure Cosmos DB SQL API) for a globally-distributed metadata service requiring consistent reads/writes across regions with 99.99% availability. Discuss trade-offs in latency, consistency model, operational overhead, operational tooling, and cost under expected scale.

Sample Answer

Situation/goal: We need a globally-distributed metadata service with strongly consistent reads/writes across regions and 99.99% availability. Below I compare AWS RDS (Postgres), Google Cloud Spanner, and Azure Cosmos DB (SQL API) across latency, consistency, operational overhead, tooling, and cost from an SRE perspective.

1) Latency- Spanner: Optimized for geo-distribution with TrueTime; single-digit to low-double-digit ms for regional reads/writes if using regional config, writes across regions incur Paxos/TrueTime coordination so higher but predictable latency.- Cosmos DB: Multi-region writes with configurable consistency; offers low read latency (~single-digit ms) via local replicas; multi-master writes give low write latency but conflict resolution possible.- RDS Postgres: Best for single-region low latency. Cross-region synchronous/async replication (Aurora Global DB) adds significant write latency or relaxed consistency if async.

2) Consistency model- Spanner: Strong external consistency globally — ideal for metadata requiring global serializability.- Cosmos DB: Tunable (Strong, Bounded Staleness, Session, etc.). Strong guarantees only within a single write region; multi-master strong consistency is limited; conflict handling is needed.- RDS: Strong consistency within primary region. Cross-region strong consistency requires synchronous replication which hurts latency; most managed setups are eventually consistent across regions.

3) Operational overhead & tooling- Spanner: Minimal operational burden for replication/partitioning; schema changes online; good SLA and monitoring via Cloud Console/Stackdriver; fewer operational tasks.- Cosmos DB: Managed scale, multi-region config easy in portal/ARM; need to manage RU provisioning, conflict resolution, indexing policies; good metrics and diagnostics in Azure Monitor.- RDS: Most operational work (backups, failover, read replicas) handled but you manage replicas, failover testing, schema migrations, cross-region failover playbooks; higher runbook complexity.

4) Cost under expected scale- Spanner: Higher baseline cost (nodes) but predictable at scale; cost-effective for strong global consistency and heavy write workloads.- Cosmos DB: Pay-per-throughput (RUs) can be expensive if metadata has varied spikes; multi-region write increases cost; may be cheaper for read-heavy patterns.- RDS: Lower per-VM cost for single region; cross-region replicas and high-availability add cost; at global scale, operational and replication costs can exceed managed global DBs.

Recommendation (SRE view)- If strict global strong consistency and simplified ops are primary: choose Spanner.- If you need flexible consistency, very low local read latency, and multi-master writes with careful conflict handling: consider Cosmos DB.- If workload is primarily single-region or you can tolerate eventual cross-region consistency and want SQL/Postgres ecosystem: RDS (Aurora Global DB) is appropriate but requires more operational runbooks to reach 99.99% globally.

Operational notes: validate failover scenarios, SLOs for cross-region write latencies, automate chaos/failover tests, monitor tail latencies and conflict rates (Cosmos), and model costs with realistic RU/node usage and multi-region replicas.

Bash and Shell ScriptingMediumTechnical

40 practiced

Write a Bash script that atomically updates a systemd unit file with provided content: write to a temporary file, validate the unit syntax using 'systemd-analyze verify', back up the existing unit (with timestamp), move the new file into place, run 'systemctl daemon-reload', restart the service, and rollback to backup if verification or restart fails. Include exit codes and logging to /var/log/deploy.log.

Sample Answer

Approach: write new unit to a temp file, verify it, backup current unit with timestamp, atomically move new file into place (mv), reload systemd, restart service; if verify or restart fails, restore backup and reload/restart. Log every step to /var/log/deploy.log and return meaningful exit codes.

bash

#!/usr/bin/env bash
set -euo pipefail

UNIT_NAME="$1"        # e.g. myservice.service
NEW_CONTENT_FILE="$2" # path to file containing new unit content
LOG="/var/log/deploy.log"
TIMESTAMP="$(date +%Y%m%d%H%M%S)"
TMP_DIR="/tmp"
EXIT_OK=0
EXIT_INVALID=2
EXIT_RESTART_FAIL=3
EXIT_USAGE=4

log(){ echo "$(date -u +"%Y-%m-%dT%H:%M:%SZ") $*" | tee -a "$LOG"; }

if [[ $# -ne 2 ]]; then
  log "ERROR: usage: $0 <unit> <new_unit_file>"
  exit $EXIT_USAGE
fi

SYSTEM_UNIT_DIR="/etc/systemd/system"
UNIT_PATH="$SYSTEM_UNIT_DIR/$UNIT_NAME"
BACKUP_PATH="${UNIT_PATH}.bak.${TIMESTAMP}"
TMP_PATH="${TMP_DIR}/${UNIT_NAME}.${TIMESTAMP}.tmp"

# write provided content to temp path (ensure readable)
cp -- "$NEW_CONTENT_FILE" "$TMP_PATH"
chmod 0644 "$TMP_PATH"
log "Wrote new unit to $TMP_PATH"

# verify syntax
if ! systemd-analyze verify "$TMP_PATH" 2>>"$LOG"; then
  log "ERROR: systemd-analyze verify failed for $TMP_PATH"
  rm -f "$TMP_PATH"
  exit $EXIT_INVALID
fi
log "Verified unit syntax OK"

# backup existing if present
if [[ -f "$UNIT_PATH" ]]; then
  cp -- "$UNIT_PATH" "$BACKUP_PATH"
  log "Backed up existing unit to $BACKUP_PATH"
fi

# atomic replace
mv -f -- "$TMP_PATH" "$UNIT_PATH"
log "Moved new unit into place: $UNIT_PATH"

# reload daemon
if ! systemctl daemon-reload >>"$LOG" 2>&1; then
  log "ERROR: daemon-reload failed; attempting rollback"
  cp -f -- "$BACKUP_PATH" "$UNIT_PATH" 2>/dev/null || true
  systemctl daemon-reload >>"$LOG" 2>&1 || true
  exit $EXIT_INVALID
fi
log "systemd daemon-reload OK"

# restart service
if ! systemctl restart "$UNIT_NAME" >>"$LOG" 2>&1; then
  log "ERROR: restart failed; rolling back to $BACKUP_PATH"
  if [[ -f "$BACKUP_PATH" ]]; then
    cp -f -- "$BACKUP_PATH" "$UNIT_PATH"
    systemctl daemon-reload >>"$LOG" 2>&1 || true
    systemctl restart "$UNIT_NAME" >>"$LOG" 2>&1 || log "WARN: rollback restart may have failed"
  fi
  exit $EXIT_RESTART_FAIL
fi

log "Restart succeeded for $UNIT_NAME"
exit $EXIT_OK

Key points:- Uses atomic mv to replace unit file.- Verifies before replacing to avoid deploying invalid unit.- Keeps timestamped backup to restore on failure.- Logs all actions to /var/log/deploy.log and returns explicit exit codes for invalid unit, restart failure, or usage error.

Incident Leadership and PostmortemsEasyBehavioral

25 practiced

Tell me about a time when you served as Incident Commander or supported an IC during a major outage. Describe the situation using the STAR format: the context, the specific actions you took to stabilize systems, how you communicated with engineers and nontechnical stakeholders, and what the measurable outcome was.

Data Structures and ComplexityHardTechnical

87 practiced

You need to compute 99th percentile latency per service in real-time with bounded memory and mergeable summaries across shards. Compare reservoir sampling, t-digest, Greenwald-Khanna (GK) algorithm, and fixed histograms: describe update complexity, memory vs accuracy tradeoffs, and which you'd pick for SRE telemetry focusing on high quantiles.

Sample Answer

Situation: We need an online, bounded-memory, mergeable per-service 99th-percentile telemetry across shards. Below I compare four summaries on update cost, memory vs accuracy, mergeability, and practical fit for SRE high-quantile monitoring, then give a recommendation.

Reservoir sampling- Update complexity: O(1) per sample (constant-time reservoir replace with random).- Memory vs accuracy: Fixed memory but accuracy for tails is poor — uniform sample approximates distribution overall, so rare high-latency events (99th) are likely under-sampled unless reservoir very large.- Mergeability: Simple (concatenate or reservoir-merge strategies), but merged reservoir still loses tail fidelity.- Verdict: Easy and lightweight, but not suitable when you care about high quantiles.

t-digest- Update complexity: amortized O(log k) per update (k = number of centroids / compression factor); can be near O(1) with buffering then compressing.- Memory vs accuracy: Very good tail accuracy with small memory; parameters (compression) trade memory vs fidelity. Designed to concentrate centroids in tails.- Mergeability: Excellent — digests merge naturally with controlled error.- Verdict: Strong candidate for 99th percentile in distributed telemetry.

Greenwald-Khanna (GK)- Update complexity: O(log(1/ε)) per insert or amortized; needs occasional compress/merge steps.- Memory vs accuracy: Deterministic error guarantees: returns φ-quantile within ±ε rank. Memory grows as O(1/ε). To get tight relative error in tails you may need small ε → large memory.- Mergeability: Non-trivial; merging two GK summaries requires careful recomputation or extra error; some implementations support merge but with increased complexity.- Verdict: Good when deterministic bounds are required; less practical if you need tiny error at extreme tails without large memory.

Fixed histograms (fixed buckets)- Update complexity: O(1) per sample (increment bucket).- Memory vs accuracy: Memory = number of buckets. To get good 99th percentile accuracy may need many buckets or adaptive bucketing; fixed linear/log buckets trade-offs. Poor when distribution shifts or you care about fine tail resolution.- Mergeability: Trivial to merge (sum counts).- Verdict: Very simple and fast; choose if latency ranges are known and limited or for coarse SLOs.

Recommendation for SRE telemetry focusing on high quantiles- Primary pick: t-digest. It gives compact, mergeable summaries with strong empirical accuracy in tails, small memory, and easy per-shard merge. Use a compression tuned for tail fidelity (higher compression or tail-biased variant). Buffer small batches and compress to reduce update overhead if necessary.- If you need strict deterministic rank error or regulatory guarantees: use GK with appropriately small ε (accept higher memory).- Use fixed histograms as a complementary low-cost aggregate (fast, exact bucket counts) or fallback for very high-cardinality, low-overhead metrics.- Avoid reservoir sampling for 99th-percentile alerting unless you can afford very large reservoirs.

Practical notes- Tune t-digest compression to meet SLO false-alarm tolerances; validate with synthetic tails and production traces.- Always test merges across realistic shard skews.- Add percentile alerting with burn-in thresholds and anomaly smoothing to avoid noisy alerts from rare outliers.

Cross Functional Collaboration and CoordinationHardTechnical

38 practiced

During a multi-region outage affecting EMEA and APAC where data residency and local regulators must be notified, outline the incident coordination plan. Include roles, cross-region escalation triggers, regulatory notification timelines and owners, region-specific customer communications, and how to prepare the post-incident compliance report.

Sample Answer

Framework: treat this as a cross-functional, compliance-first incident. I’d run a two-track response: Operational (restore service) and Regulatory/Customer (notifications, record-keeping).

Immediate actions (0–30m)- Incident Commander (IC, SRE on-call) declares multi-region outage and activates Incident Response War Room.- Roles: IC (coordinates), Recovery Lead (engineering/ops), Communications Lead (PR/Customer Ops), Compliance/Legal Owner, Regional Ops Leads (EMEA, APAC, NA), Escalation Exec (VP Ops).- Triage: impact scope (services, tenants, data residency boundaries), data classification (PII/regulated), affected regions/customers.

Escalation triggers- Service unavailability >15% of regional traffic or >5 minutes for critical services → escalate to Recovery Lead.- Evidence of data access/exfiltration or PII exposure → immediate Compliance/Legal escalation.- Outage >1 hour or impacting SLAs/SLOs → Exec escalation and formal customer comms.

Regulatory notification timelines & owners- GDPR (EU): 72-hour breach notification to DPA where feasible. Owner: Compliance/Legal + Regional Ops Lead (EMEA).- APAC: country-specific (e.g., Australia OAIC 30 days for eligible data breaches; Singapore PDPC 72 hours recommended). Owner: Compliance + APAC Lead.- If unsure, default: notify Compliance within 1 hour to assess; if classified as notifiable, prepare regulator notice within the strictest applicable timeline (often 24–72h). Compliance owns filings; IC/Recovery provides technical facts.

Region-specific customer communications- Within 1 hour: “We are aware, investigating” brief sent by Communications Lead to affected customers in local language/timezone; include region, scope, ETA for next update.- Every update cadence: 30–60 minutes while active, then hourly as stabilization occurs; final root-cause and remediation within 72 hours.- For regulated customers (banking/health): personalized outreach via account teams and required regulatory templates (include data types affected, mitigations, contact for follow-up).

Evidence & logging (during incident)- Preserve logs, packet captures, config changes, deployment history; freeze relevant systems for forensic integrity (Compliance direction).- Timestamp all actions in incident timeline; assign note-taker.

Post-incident compliance report (deliver within 7–14 days; initial draft within 72h for regulators)Contents:- Executive summary (impact, regions, duration)- Timeline of events with timestamps and owners’ actions- Root cause analysis and technical details (what failed, why)- Data residency and data exposure assessment (what data, number of users)- Mitigations applied and remediation plan with owners and ETA- Regulatory notifications sent (date, recipient, content), customer communications log- Evidence appendices (logs, config diffs, snapshots)- Preventive controls and SLO/alerting changesOwners & process:- Draft by Recovery Lead + SRE, validated by Compliance/Legal, finalized by IC. Share with Execs and regulators as required.Lessons & follow-up- Create 30/60/90-day remediation tickets, mandatory post-incident review meeting with executive sign-off on closure.

This plan balances rapid engineering response with compliance timelines and clear ownership to satisfy regulators and maintain customer trust.

Fault Tolerance and System ResilienceEasyTechnical

59 practiced

Compare backpressure and rate limiting. For an asynchronous ingest pipeline composed of API gateway -> ingress service -> queue -> worker pool, indicate where backpressure should be applied versus where rate limits should be enforced, and explain why.

Database Selection and Trade OffsEasyTechnical

40 practiced

Explain why time-series databases (InfluxDB, Prometheus, TimescaleDB) are optimized for metrics and events. For a monitoring workload with 100k metrics at 10s resolution, describe how compression, retention policies, downsampling/rollups, cardinality and ingestion rate influence your choice, and how you'd design retention tiers and queries to meet both performance and cost goals.

Sample Answer

Time-series DBs are optimized for metrics/events because they store append-only, time-keyed data, enabling block/columnar storage, efficient compression, and fast range queries. They provide indexed time partitioning, native aggregation, and lifecycle policies (retention, downsampling) tailored to monitoring workloads.

Scenario: 100k metrics @ 10s → ~8.64M samples/minute (~518M/hour). Key factors and design:

- Compression: Choose a TSDB with delta/bit-packing and Gorilla-style compression (Prometheus TSDB, Influx/Timescale all support variants). Good compression reduces disk I/O and storage cost — expect 5–20× depending on metric sparsity.- Ingestion rate: Ensure write throughput (parallel ingestion, batching). For 100k series at 6 writes/minute each (10s), provision CPU and write-ahead log throughput; use sharding/HA instances (Prometheus remote-write to Cortex/Thanos or Timescale multi-node).- Cardinality: 100k series is moderate but can explode with high-label cardinality. Limit high-cardinality labels (use service, job, host but avoid per-request IDs). Use relabeling/metric aggregation at scrape time.- Retention & downsampling: Design tiers: - Hot tier (raw 10s): keep 7–14 days for detailed troubleshooting and alert fidelity. - Warm tier (1m or 5m rollups): keep 60–90 days for trend analysis. - Cold tier (1h/1d aggregates): keep 1–3 years for capacity planning. Implement automated rollups: aggregate raw to 1m/5m/1h using continuous queries or compaction jobs and then drop raw older than hot-retention.- Queries & performance: Route high-frequency dashboards and alerts to hot tier (fast storage, cache). Use pre-aggregated series for long-range graphs. For expensive ad-hoc queries, query downsampled data or use background materialized views.- Cost trade-offs: Balance disk vs. compute; more aggressive downsampling and shorter raw retention greatly reduce storage. Limit cardinality to reduce index size and CPU.

In practice: scrape relabel to reduce labels, remote-write raw to a horizontally scalable store (Cortex/Thanos/Influx Enterprise/Timescale), configure continuous downsampling jobs, set retention per tier, and provision nodes for peak ingestion. Monitor ingestion lag, compaction times, and query latencies and iterate.

Bash and Shell ScriptingMediumTechnical

34 practiced

Implement a Bash function 'retry_with_backoff' that accepts a command and retries it on failure using exponential backoff with full jitter. Parameters: max_attempts (default 5), base_delay_seconds (default 1). The function should print attempt number and delay, and return the last non-zero exit code if all attempts fail. Use only bash builtins and coreutils and make it safe for use in automation.

Sample Answer

Approach: use exponential backoff with full jitter (cap = base * 2^(attempt-1); delay = random uniform [0, cap]). Use bash $RANDOM scaled to milliseconds so we can sleep fractional seconds without non-core tools. Function prints attempt number and chosen delay, runs the command via eval (safe for automation if the caller provides command as array recommended), and returns last non-zero exit code if all attempts fail.

bash

retry_with_backoff() {
  # Usage: retry_with_backoff [--max-attempts N] [--base-delay-seconds S] -- cmd [args...]
  local max_attempts=5
  local base_delay_seconds=1
  local argv=()
  while [[ $# -gt 0 ]]; do
    case "$1" in
      --max-attempts) max_attempts="$2"; shift 2;;
      --base-delay-seconds) base_delay_seconds="$2"; shift 2;;
      --) shift; argv=("$@"); break;;
      *) argv+=("$1"); shift;;
    esac
  done

  if [[ ${#argv[@]} -eq 0 ]]; then
    printf '%s\n' "No command provided" >&2
    return 2
  fi

  # Work in milliseconds to avoid floating arithmetic tools
  local base_ms=$(( base_delay_seconds * 1000 ))
  if (( base_ms <= 0 )); then base_ms=1000; fi

  local attempt=1
  local last_exit=0

  while (( attempt <= max_attempts )); do
    printf 'Attempt %d/%d: running: %s\n' "$attempt" "$max_attempts" "${argv[*]}"
    "${argv[@]}"
    last_exit=$?
    if (( last_exit == 0 )); then
      return 0
    fi

    # compute cap_ms = base_ms * 2^(attempt-1)
    local cap_ms=$(( base_ms * (1 << (attempt - 1)) ))

    # scale RANDOM (0..32767) into 0..cap_ms
    local rand=$RANDOM
    local delay_ms=$(( rand * cap_ms / 32767 ))

    # format as seconds.milliseconds for sleep and display
    local sec=$(( delay_ms / 1000 ))
    local msec=$(( delay_ms % 1000 ))
    local delay_str
    printf -v delay_str '%d.%03d' "$sec" "$msec"

    printf 'Attempt %d failed (exit %d). Sleeping %s seconds before retry.\n' "$attempt" "$last_exit" "$delay_str"
    sleep "$delay_str"

    attempt=$(( attempt + 1 ))
  done

  printf 'All %d attempts failed. Last exit code: %d\n' "$max_attempts" "$last_exit" >&2
  return $last_exit
}

Key points:- Use array form ("${argv[@]}") to avoid word-splitting and shell injection; prefer calling as: retry_with_backoff -- mycmd arg1 arg2- Full jitter: uniform random in [0, cap]- Millisecond arithmetic uses only bash integers and $RANDOM; no external float tools- Returns 0 on success or last non-zero exit code on final failure

Edge cases:- If base_delay_seconds not integer, convert externally or adapt the function (current impl assumes integer seconds)- VERY long caps may overflow; you can clamp cap_ms to a maximum (e.g., 5 minutes) if desired for automation safety- Caller should avoid passing untrusted strings for eval-like behavior; this uses exec form so it's safe for args.

Practice Site Reliability Engineer (SRE) questions across all topics

Additional Information

Want to create your own tailored preparation guide using our deep research?

Get Started for Free

Interview-Ready Courses

Visual-first, interactive, structured learning paths

Browse Site Reliability Engineer (SRE) jobs

AI-enriched listings across hundreds of company career pages

Explore Jobs

Apple Site Reliability Engineer (Senior Level) - Comprehensive Interview Preparation Guide

Interview Process Overview

Interview Rounds

Recruiter Screening

What to Expect

Tips & Advice

Focus Topics

Communication and Culture Fit

Practice Interview

Study Questions

SRE Background and Experience Summary

Practice Interview

Study Questions

Systems Internals Deep Dive

What to Expect

Tips & Advice

Focus Topics

Shell Interpretation and Command Execution

Practice Interview

Study Questions

System Call Interface and Kernel-User Space Interaction

Practice Interview

Study Questions

Linux Process Management and /proc Filesystem

Practice Interview

Study Questions

Linux Troubleshooting Methodology and Tools

Practice Interview

Study Questions

Linux Memory Management and Virtual Memory

Practice Interview

Study Questions

SRE/Networking Deep Dive

What to Expect

Tips & Advice

Focus Topics

HTTP Protocol and Web Communication

Practice Interview

Study Questions

Network Troubleshooting and Observability

Practice Interview

Study Questions

Network Request Flow and Distributed System Communication

Practice Interview

Study Questions

TLS/SSL Protocol and HTTPS

Practice Interview

Study Questions

TCP Protocol and Connection Management

Practice Interview

Study Questions

DNS Resolution and Caching

Practice Interview

Study Questions

Coding/Algorithms Assessment

What to Expect

Tips & Advice

Focus Topics

Code Quality and Communication

Practice Interview

Study Questions

Algorithm Complexity Analysis

Practice Interview

Study Questions

Data Structures Fundamentals

Practice Interview

Study Questions

Graph Algorithms (BFS and DFS)

Practice Interview

Study Questions

System Design Round

What to Expect

Tips & Advice

Focus Topics

Database Design and Trade-offs

Practice Interview

Study Questions

Deployment, Rollback, and Change Management

Practice Interview

Study Questions