Apple Site Reliability Engineer (Mid-Level) Interview Preparation Guide 2026

Site Reliability Engineer (SRE)

Apple

Mid Level

7 rounds

Updated 6/15/2026

Apple's SRE interview process for mid-level candidates consists of a structured seven-round evaluation combining technical depth, system design capabilities, and cultural alignment. The process includes initial recruiter screening, two technical phone screens covering Linux systems and networking, and a full-day virtual onsite with four rounds assessing systems internals, SRE practices and observability, coding and automation, and system design. Behavioral and Apple values assessment are integrated throughout the interview process. Based on recent interview data, the total timeline typically spans 4-8 weeks from application to offer.

Interview Rounds

Recruiter Screening

45 min4 focus topicsbehavioral

What to Expect

This combined round includes the recruiter's initial contact and follow-up screening. The recruiter verifies your background, confirms interest in the SRE role, and assesses basic alignment with position requirements. Discussions cover your experience with system reliability, operations, relevant technical skills, and why you're interested in Apple specifically. This round also serves as a logistics coordination point: confirming timeline, discussing team structure, clarifying role expectations, and scheduling subsequent phone screens. Upon successful completion, the recruiter provides interview guidelines and technical phone screen logistics.

Tips & Advice

Be enthusiastic and specific about why this SRE role at Apple interests you. Prepare a clear narrative about your background: specific systems you've worked on, operational challenges you've solved, and concrete impact (e.g., 'I reduced incident response time by 40% through automation'). Have 2-3 detailed project examples ready. Ask informed questions showing you've researched Apple: mention specific products, reliability standards, or publicly known infrastructure challenges. Research what's publicly known about Apple's infrastructure and reliability requirements. Be professional but conversational. Confirm scheduling details and clarify timezone requirements. Show genuine enthusiasm for reliability engineering as a discipline.

Focus Topics

Apple's Reliability Standards & Products

Demonstrate understanding of why reliability is paramount at Apple: device ecosystem across hardware and software, user expectations, brand reputation. Show you've thought about how you'd contribute to Apple's high reliability standards. Mention any personal experience with Apple products or services.

Practice Interview

Study Questions

Specific Projects & Measurable Impact

Prepare 2-3 detailed stories of projects you owned or significantly contributed to. For each: What was the initial state? What was the problem? What did you do? What was the measurable outcome? How did you mentor others? Why are you proud of this work?

Practice Interview

Study Questions

Career Narrative & SRE Background

Clearly articulate your progression through SRE or operations roles with concrete examples: types of systems managed, scale handled (users, requests/second, data volume), and measurable impact. Connect your experience to Apple's reliability requirements. Explain what drew you to SRE and why you want to work at Apple specifically.

Practice Interview

Study Questions

Technical Skills & Tech Stack Proficiency

Highlight core SRE competencies: Linux/systems administration depth, monitoring and observability expertise, incident response experience, automation capabilities, and relevant programming languages. Mention specific tools you've used: Prometheus, Kubernetes, Python, Go, Terraform. Be prepared to discuss why you chose certain tools or approaches.

Practice Interview

Study Questions

Technical Phone Screen 1: Linux Systems & Troubleshooting

60 min5 focus topicstechnical

What to Expect

This round tests systematic debugging methodology and deep Linux systems knowledge. The interviewer presents a complex system problem (such as SSH not working with console access or services failing to start) and asks you to diagnose the root cause. You'll navigate the /proc filesystem, interpret system state, use diagnostic tools, and explain your reasoning at each step. The focus is on methodology and logical progression rather than immediately knowing the answer. Expect questions about process management, memory behavior, system calls, and performance analysis. Interviewers assess both technical depth and your approach to problem-solving under uncertainty.

Tips & Advice

Before the interview, ensure comfortable proficiency navigating a Linux system via SSH. Practice real troubleshooting on your own systems—set up problems deliberately and solve them. During the interview, ask clarifying questions about symptoms before diving into diagnosis. Walk through your systematic process: gather information, form hypotheses, test them iteratively, verify the fix. Use tools confidently: strace (system calls), lsof (open files/sockets), tcpdump (network packets), netstat/ss (connection state), vmstat (memory/CPU), iostat (disk I/O). Know /proc filesystem structure and what information each file contains. Think out loud so the interviewer understands your reasoning. If stuck, pivot and try a different angle—demonstrate flexibility. For mid-level candidates, interviewers expect methodical narrowing of problem space, not random command trials.

Focus Topics

/proc Filesystem Navigation & System State Inspection

Master the /proc filesystem: /proc/[pid]/ for process details (maps, fd, status), /proc/net/ for networking state, /proc/meminfo for memory status, /proc/stat for CPU metrics, /proc/loadavg for system load, /proc/interrupts for interrupt activity. Know what each file contains and how to interpret data for diagnosis.

Practice Interview

Study Questions

System Performance Analysis & Bottleneck Identification

Analyze system performance using tools: top/htop (real-time resource usage), vmstat (memory/CPU context switches), iostat (disk I/O patterns), load average interpretation, perf (performance profiling). Identify bottlenecks: CPU-bound vs I/O-bound, memory pressure, disk saturation. Understand implications for reliability.

Practice Interview

Study Questions

Process Management & Process Lifecycle

Understand process creation (fork, exec), process states (running, sleeping, zombie), process hierarchy, and signals. Know how to inspect process state via /proc/[pid]/, interpret ps output, understand memory and CPU usage per process, diagnose zombie processes. Know PID 1 (init/systemd) role and process supervision.

Practice Interview

Study Questions

Memory Management & Virtual Memory

Understand virtual address space, physical memory allocation, page tables, virtual-to-physical address translation, memory protection. Know Linux memory zones (DMA, Normal, High), memory caching, swapping/paging mechanics. Interpret /proc/meminfo, understand memory pressure and OOM (Out of Memory) killer behavior. Know memory-related issues: memory leaks, excessive swapping, OOM scenarios.

Practice Interview

Study Questions

Systematic Linux Troubleshooting Methodology

Master a structured approach to diagnosing system issues: (1) clearly define what's wrong, (2) gather system state (logs, processes, network, disk, memory), (3) form hypotheses about root cause, (4) test hypotheses iteratively, (5) validate the fix doesn't break anything else. Know key diagnostic tools: strace (trace system calls), lsof (open files/network), tcpdump/Wireshark (packet inspection), ss/netstat (connections), vmstat/iostat (performance), top/htop (resource usage).

Practice Interview

Study Questions

Technical Phone Screen 2: Networking & Protocols

60 min5 focus topicstechnical

What to Expect

This round evaluates networking knowledge essential for distributed systems reliability. The interviewer conducts a deep dive into TCP/IP, DNS, HTTP/HTTPS, TLS, and load balancing. Expect questions like 'walk me through what happens when you access icloud.com' or 'explain TLS handshake and failure points.' You'll discuss protocol layers, network failure scenarios, debugging network issues, and how networking choices affect reliability. Unlike network engineers, SREs focus on reliability implications: how do network problems manifest in applications, how to detect them, how to mitigate them.

Tips & Advice

Review networking fundamentals with emphasis on practical implications for reliability. Understand the complete request path from client to server: DNS resolution, TCP connection establishment, TLS handshake, HTTP request/response. Know common networking failure modes and how they manifest: connection timeouts, DNS failures, packet loss, port exhaustion. Be comfortable with diagnostic tools: tcpdump/Wireshark (packet inspection), dig/nslookup (DNS), curl with verbose output, netstat/ss (connection state), mtr (route tracing). Understand load balancing strategies and their reliability tradeoffs. Discuss connection pooling, keep-alives, and retry strategies. For mid-level SREs, be able to think about how networking affects system reliability and give examples of network issues you've debugged. Practice explaining protocol behavior clearly.

Focus Topics

Load Balancing Strategies & Traffic Distribution

Understand load balancing algorithms: round-robin (fair distribution but ignores load), least connections (considers current connections), hash-based (consistent hashing for state affinity). Know Layer 4 (TCP) vs Layer 7 (application) load balancing tradeoffs. Understand health checking, failover mechanisms, sticky sessions. Know how load balancing choices affect reliability and performance.

Practice Interview

Study Questions

Network Troubleshooting & Diagnostic Tools

Master networking diagnostic tools: tcpdump/Wireshark for packet capture and analysis, dig/nslookup for DNS queries, curl with verbose output for HTTP debugging, netstat/ss for connection state inspection, traceroute/mtr for routing analysis, iperf for throughput testing. Know how to capture and interpret network traces.

Practice Interview

Study Questions

DNS Resolution & Service Discovery Reliability

Understand DNS protocol (recursive vs authoritative queries), query types (A, AAAA, CNAME, MX, SRV), caching and TTL implications, DNS propagation timing. Know how DNS failures impact service availability and how they cascade. Understand common DNS issues: resolution timeouts, NXDOMAIN responses, cache inconsistencies, split-brain scenarios.

Practice Interview

Study Questions

TCP/IP Fundamentals & Connection Reliability

Understand TCP three-way handshake, connection establishment, connection states (SYN-SENT, ESTABLISHED, TIME-WAIT), sequence numbers and acknowledgments, retransmission logic, congestion control (window sizing), and timeouts. Know UDP characteristics and when each is appropriate. Understand connection failure modes and diagnosis. Know about socket backlog and listen queue effects on reliability.

Practice Interview

Study Questions

HTTPS/TLS Security & Connection Handling

Understand TLS handshake (ClientHello, ServerHello, key exchange, finished), certificate validation, mutual TLS (mTLS). Know cipher suites and their selection. Understand common TLS issues: certificate expiration, hostname mismatch, weak ciphers, TLS version incompatibility. Know how TLS impacts latency and performance. Understand TLS session resumption.

Practice Interview

Study Questions

Onsite Round 1: Systems Internals Deep Dive

75 min5 focus topicstechnical

What to Expect

This first onsite round (typically virtual for mid-level candidates) dives deep into Linux kernel concepts and complex system behavior. The interviewer presents multi-layered system problems requiring understanding of kernel internals, advanced memory management, process scheduling, and I/O subsystems. You may diagnose complex system hangs, optimize performance under resource constraints, or explain unusual system behavior. Interviewers repeatedly ask 'why' to test understanding of underlying mechanisms, not surface-level knowledge. Expect discussions of kernel tuning, performance implications of different configurations, and tradeoffs in system design.

Tips & Advice

This round goes significantly deeper than phone screens. Review Linux kernel architecture and internals thoroughly. Understand process scheduling algorithms, memory management mechanisms (paging, segmentation, virtual memory), and I/O subsystems in detail. Be prepared for 'why' questions: Why does the kernel make certain design decisions? What are the tradeoffs? Prepare to explain complex scenarios: what happens when system memory is exhausted, how the kernel handles I/O under extreme load, how process scheduling ensures fairness. Practice explaining technical concepts clearly with analogies or diagrams when helpful. For mid-level, interviewers expect understanding of tradeoffs and design principles, not just facts. Bring specific examples: kernel tuning you've performed, performance issues you've diagnosed and solved, reliability improvements from system configuration changes. Be ready to discuss how kernel behavior affects application reliability.

Focus Topics

System Performance Tuning & Kernel Parameters

Know kernel tuning parameters (sysctl): network buffers, TCP timeouts, memory swappiness, process scheduling. Understand performance profiling tools: perf for CPU profiling, flame graphs for visualization, kernel tracing (tracepoints, kprobes). Know when and how to apply tuning for specific workloads. Understand tradeoffs: latency vs throughput, memory usage vs performance.

Practice Interview

Study Questions

I/O Subsystem & Storage Reliability

Understand I/O scheduler algorithms (CFQ—Completely Fair Queueing, deadline, noop), disk buffering and writeback caches, fsync and O_DIRECT semantics, RAID reliability, filesystem journaling. Know how I/O errors are handled and reported. Understand implications for data reliability. Know performance characteristics of different I/O patterns.

Practice Interview

Study Questions

Advanced Memory Management & Kernel Memory Subsystem

Understand page tables and virtual address translation, memory protection through page table entries, copy-on-write (CoW) optimization, memory reclamation and page eviction, swap mechanics and its performance implications. Know Linux memory pressure handling including kswapd (kernel swapper daemon) and OOM killer. Understand memory fragmentation and its effects. Know kernel memory accounting and cgroup memory limits.

Practice Interview

Study Questions

Process Scheduling & CPU Management

Understand Linux process scheduler: run queues per CPU, scheduling algorithms (CFS—Completely Fair Scheduler—for normal processes, real-time scheduling classes), context switching overhead, CPU affinity and NUMA considerations. Know how to interpret scheduler metrics (load average, context switches, runnable queue length). Understand scheduling classes and priority levels. Know how to diagnose CPU-bound system issues.

Practice Interview

Study Questions

Linux Kernel Architecture & Core Subsystems

Understand kernel organization: process management subsystem, memory management (virtual memory, paging, segmentation), interrupt handling and exceptions, device drivers interface, filesystem abstraction. Know kernel space vs user space, system call interface, and how applications interact with kernel. Understand kernel protection mechanisms preventing user applications from directly accessing hardware.

Practice Interview

Study Questions

Onsite Round 2: SRE Practices & Observability

60 min5 focus topicsbehavioral|technical

What to Expect

This round evaluates your understanding of core SRE principles, operational practices, and observability architecture. The interviewer discusses monitoring strategy, defining and managing SLOs/SLIs/error budgets, incident response processes, automation priorities, and toil reduction. You'll answer questions like 'How do you measure if a system is reliable?', 'What would you monitor for a new service?', or 'Walk me through your incident response process.' This round includes significant behavioral assessment: collaboration during incidents, communication style, how you approach operational excellence, and your philosophy on reliability. For mid-level, emphasis is on end-to-end ownership: designing observable systems, establishing appropriate SLOs, and leading incident response.

Tips & Advice

Prepare concrete examples: monitoring you've designed and why you chose those metrics, SLOs you've established and how you justified them, incidents you've handled and lessons learned. Be ready to discuss tradeoffs: monitoring overhead vs observability value, alert sensitivity vs alert fatigue, SLO strictness vs development velocity. Understand SRE philosophy: reliability with velocity, using error budgets intelligently to make tradeoff decisions, automating toil. Know the four golden signals (latency, traffic, errors, saturation) and how to apply them. Be prepared to discuss specific observability tools (Prometheus, DataDog, Splunk, ELK) but focus on concepts over implementation details. Discuss automation examples: deployments you've automated, operational tasks you've eliminated, processes you've streamlined. For mid-level, interviewers want strategic thinking about operations: how to scale systems, systematically improve reliability, empower team members. Share examples of mentoring junior team members on SRE practices.

Focus Topics

Toil Identification & Automation Prioritization

Understand toil: repetitive, manual, unrewarding tasks that don't add long-term value. Know how to identify toil in your operations, quantify its impact (hours/week), and prioritize automation efforts. Understand common automation targets: deployments, autoscaling, backup/recovery, health checks. Know infrastructure-as-code and configuration management approaches. Understand ROI of automation: development cost vs time saved.

Practice Interview

Study Questions

Observability Tools & Metrics Collection Strategies

Understand industry-standard tools: Prometheus (time-series metrics), ELK/Splunk (logging and analysis), Jaeger/Zipkin (distributed tracing). Know push vs pull metrics collection models, time-series database concepts, query languages (PromQL). Understand performance implications of different observability approaches: collection overhead, storage requirements, query latency. Know cost-benefit tradeoffs of different observability solutions.

Practice Interview

Study Questions

Incident Response & Postmortem Culture

Understand incident classification (severity levels), escalation procedures, incident communication, incident command structure. Know effective postmortem processes: document what happened, root cause analysis (not blame), identify systemic improvements, track action items. Understand blameless culture principles and psychological safety in incident reviews. Know how to prevent similar incidents through systemic fixes, not individual blame.

Practice Interview

Study Questions

Monitoring, Alerting & Observability Architecture Design

Design comprehensive monitoring: identify key metrics (four golden signals: latency, traffic, errors, saturation), instrument systems appropriately, define meaningful alerts, establish alert routing and escalation. Understand tracing for distributed request paths. Understand logging for detailed investigation. Design for observability: avoid blind spots in monitoring, ensure metrics are actionable, prevent alert fatigue through intelligent alerting.

Practice Interview

Study Questions

Service Level Objectives (SLOs), SLIs & Error Budgets

Understand SLO definition: specific, measurable objectives tied to business requirements (e.g., '99.9% availability monthly'). Distinguish between SLOs and SLIs (Service Level Indicators—actual measurements). Know error budget concept: if SLO is 99.9%, you have 0.1% error budget (failures allowed). Use error budgets for tradeoff decisions between reliability investment and feature development. Understand SLO implications on engineering priorities and resource allocation.

Practice Interview

Study Questions

Onsite Round 3: Coding & Automation

75 min5 focus topicstechnical

What to Expect

This round combines algorithm problem-solving with SRE-relevant practical scenarios. Expect one standard coding problem (LeetCode Easy to Medium difficulty, often involving data structures like trees or graphs) and/or SRE-specific challenges like log parsing/aggregation, implementing a monitoring system, or automating operational tasks. The focus is on coding proficiency, debugging ability, and ability to write clean, maintainable code. Unlike software engineer interviews, emphasis is less on optimal algorithmic complexity and more on correctness, clarity, practical applicability, and production-readiness.

Tips & Advice

Review LeetCode focusing on tree and graph problems (BFS/DFS). Practice in Python or Go (common SRE languages at Apple). During the interview, clarify requirements before coding, talk through your approach, and write clean, readable code. Test your solution with examples including edge cases. For SRE-specific problems, think about real operational scenarios: handling incomplete data, network timeouts, rate limiting. Discuss tradeoffs: performance vs readability, quick-and-dirty vs production-ready code. For mid-level, write production-quality code and discuss testing, error handling, and monitoring of your own code. Know basic debugging: print statements, logging, understanding error messages. Be comfortable with standard library functions in your chosen language.

Focus Topics

Python/Go & SRE-Relevant Language Proficiency

Strong proficiency in primary SRE language (likely Python or Go at Apple). Know standard library functions for common tasks: requests for HTTP, json for data handling, subprocess for system interaction, file I/O. Understand language-specific idioms and best practices. Know performance characteristics and limitations of the language.

Practice Interview

Study Questions

Debugging & Systematic Problem-Solving

Demonstrate systematic debugging: identify the problem clearly, isolate the cause, form hypotheses, test them iteratively, validate the fix. Be comfortable with print debugging, understanding error messages and stack traces. Know when to use debuggers vs other approaches. Understand common bugs: off-by-one errors, null pointer dereferences, resource leaks.

Practice Interview

Study Questions

Algorithm Implementation & Data Structures Proficiency

Master common data structures (arrays, linked lists, binary trees, graphs, hash tables) and their operations. Implement basic algorithms (sorting, searching, BFS/DFS, tree traversal). Understand time and space complexity implications. Write implementations that are correct, clear, and reasonably efficient. Know when to use different data structures based on use case.

Practice Interview

Study Questions

Practical SRE Scenarios & Operational Scripting

Ability to solve real SRE problems: parsing and aggregating logs to extract metrics, implementing health checks, writing deployment scripts, automating data processing, rate limiting implementations. Know how to handle common issues: file handling errors, network timeouts, retries with backoff. Write scripts that handle partial failures gracefully.

Practice Interview

Study Questions

Production Code Quality & Maintainability

Write code that is correct, readable, and maintainable: meaningful variable and function names, appropriate comments, error handling for failure cases, edge case consideration, input validation. Write code that others can understand and modify. Think about testing: how would this code be tested? Write code defensively against invalid inputs or unexpected conditions.

Practice Interview

Study Questions

Onsite Round 4: System Design

75 min5 focus topicstechnical

What to Expect

This final onsite round evaluates your ability to design scalable, reliable distributed systems. You'll receive an open-ended design problem (e.g., 'Design a system like GitHub handling repositories, pull requests, and merging for scale' or 'Design a reliable task queue') and discuss the entire architecture. Cover system components, data flow, consistency models, failure handling, monitoring, deployment strategy, and tradeoffs. For mid-level SREs, the unique focus is operational and reliability aspects alongside scalability: How is this system deployed? How is it monitored? How does it recover from failures? What's the disaster recovery strategy? Unlike software engineers who focus on correctness and scalability, mid-level SREs emphasize operability.

Tips & Advice

Prepare by reviewing system design principles: scalability (horizontal vs vertical scaling tradeoffs), consistency models (strong vs eventual consistency), availability and partition tolerance (CAP theorem). Know common architectural patterns: microservices, database replication strategies, load balancing, caching layers, queue-based architectures. Practice structured approach: clarify requirements and constraints, sketch high-level architecture, discuss key components, address failure modes, consider operational aspects. For mid-level SREs, emphasize operational considerations: deployment strategy and rollback procedures, comprehensive monitoring and alerting, incident response and recovery procedures, graceful degradation under failures, limiting blast radius of failures. Discuss how the system would be deployed, monitored, recovered from disaster scenarios. Draw diagrams clearly and explain tradeoffs thoughtfully. Think about end-to-end ownership: a system you'd be responsible for supporting in production.

Focus Topics

Operational Complexity & Deployment Strategy

Think critically about operational burden: how many moving parts, complexity of running and updating the system, dependency management, configuration complexity. Design for operational simplicity where possible: fewer components, clearer dependencies, simpler deployment. Discuss deployment strategy: blue-green deployments, canary releases, rollback procedures, infrastructure-as-code. Discuss how you'd monitor deployments and quickly detect issues.

Practice Interview

Study Questions

Data Storage, Consistency & Persistence

Choose appropriate database types (relational, NoSQL, time-series) for different data patterns. Understand consistency models (strong/immediate vs eventual consistency) and their tradeoffs. Discuss replication strategies (master-slave, multi-master), backup and recovery, disaster recovery procedures. Know transaction semantics and their reliability implications. Discuss data durability guarantees.

Practice Interview

Study Questions

Scalable System Architecture & Core Components

Design principles for scalability: load balancing strategies, horizontal scaling of stateless services, database scaling (replication, sharding), caching layers (reducing load on databases), asynchronous processing via queues, CDN for static content. Know component interactions, data flow patterns, consistency tradeoffs (immediate vs eventual consistency). Discuss why you chose specific architectural patterns for your use case.

Practice Interview

Study Questions

Reliability Through Redundancy & Failure Handling

Design for failures: redundancy (multiple instances, geographic distribution), circuit breakers (preventing cascading failures), retries with exponential backoff, bulkheads (isolating failure blast radius), graceful degradation (reduced functionality under partial failures). Identify critical paths and single points of failure. Discuss failure recovery strategies and system behavior under partial degradation. Know timeout and retry semantics.

Practice Interview

Study Questions

Observability & Monitoring Architecture in System Design

Design systems with observability built in: identify instrumentation points, define key metrics (four golden signals: latency, traffic, errors, saturation), design health checks, plan for alert generation. Discuss distributed tracing across components for request path visibility. Design for operational visibility: structured logging, metrics aggregation, alerting and escalation. Discuss how you'd diagnose common failure modes in this system. Design runbooks for common operational tasks.

Practice Interview

Study Questions

Frequently Asked Site Reliability Engineer (SRE) Interview Questions

Incident Management and ResponseHardTechnical

93 practiced

As the SRE lead after a multi-day major incident that caused partial data loss, explain how you would organize the post-incident process to balance rapid learning with careful evidence preservation. Discuss what to publish internally vs externally, how to redact sensitive details, how to prioritize remediation vs compensation, and how to ensure actions are tracked to completion.

Sample Answer

Situation: A multi-day major incident caused partial data loss affecting customers and internal confidence.

Approach (organize post-incident to learn fast while preserving evidence):1. Immediate evidence preservation (first 24–48h)- Freeze implicated systems (read-only snapshots), preserve logs, DB backups, audit trails, and config versions; capture volatile memory if needed.- Record chain-of-custody: who accessed what, when, and why; store hashes of artifacts.- Spin a gated “forensics” workspace so engineers can experiment without altering preserved artifacts.

2. Rapid learning loop (first 72h)- Run a focused internal blameless timeline workshop: timeline, hypotheses, quick experiments against copies only.- Produce two artifacts in parallel: (A) internal detailed technical timeline + raw evidence (access-controlled) and (B) an internal-summary postmortem for broader engineering/org distribution that omits sensitive artifacts.

3. Publish strategy- Internally publish: full postmortem with detailed timeline, root-cause analysis, supporting logs and queries, remediation plan, and RCA evidence links (restricted access to security/leadership/engineers on a need-to-know basis).- Externally publish: a concise public postmortem that states impact, root cause at a high level, mitigation steps taken, customer remediation/compensation policy, and steps to prevent recurrence. Keep technical depth sufficient for trust but omit internal identifiers and sensitive implementation details.

4. Redaction & privacy- Use automated scripts to redact PII, API keys, customer IDs, internal hostnames, and stack traces that reveal account data. Replace values with deterministic placeholders (e.g., <CUSTOMER_ID_x>) and keep a private mapping in a secure vault for audits.- Have a security reviewer (infosec/legal) sign off on any public content.

5. Prioritize remediation vs compensation- Triage remediation into: Safety (stop-gap fixes to prevent recurrence), Recovery (restore data where possible), and Long-term engineering fixes.- Run a risk/cost matrix: if remediation prevents further customer harm quickly, prioritize it over compensation. If recovery is impossible or costly and customers were materially impacted, parallelize compensation/credit plans immediately while engineering works on recovery.- Example: If a partial backup corruption can be scoped and restored for 60% of accounts in 48h, allocate a focused restoration squad while payments/credits are prepared for affected customers.

6. Track to completion- Convert every action into tracked tickets (Jira/Docs) with clear owner, SLA, priority, acceptance criteria, and verification steps. Use RACI for cross-functional tasks (SRE owner, Dev support, Legal/PM stakeholder).- Weekly executive dashboard: status, blockers, and verification evidence. Require sign-off on remediation with evidence (tests, chaos/rollback exercises).- Close the loop: run a follow-up review 30/90 days to verify fixes held, update runbooks, and add SLO adjustments.

Outcome/learning:- Balances urgency with forensic integrity, preserves trust via transparent external communication, mitigates legal/privacy risks through redaction and sign-off, and ensures concrete, tracked remediation until verification and closure.

Automation and ScriptingHardSystem Design

81 practiced

Architect a safe multi-service orchestration system that coordinates deployments across multiple regions. Requirements: support region-level canaries, dependency ordering between services, resilience to partial failures, idempotent orchestration steps, and safe rollback without causing region-wide outages. Describe control plane, agents, state model, and failure handling.

Sample Answer

Requirements clarified:- Region-level canaries: deploy subset per region, measure metrics, promote or halt.- Dependency ordering: DAG of service dependencies, per-region orchestration respecting order.- Resilience & idempotency: retries, checkpoints, and safe rollbacks that avoid region-wide outage.

High-level architecture:- Control Plane (central): API, Orchestrator service, Coordinator, Policy Engine, Auditor, Metrics Evaluator.- Agents (regional): Lightweight workers running in each region that execute steps, report state, and can perform traffic-shift, health-checks, and rollback actions.- Storage & Messaging: Strongly-consistent metadata store (etcd/Consul) for orchestration state and leader election; durable task queue (Kafka/SQS) per region for commands; time-series DB for metrics (Prometheus/TSDB).

Control Plane responsibilities:- Accept deployment plan (version, target regions, canary config, dependency DAG).- Expand DAG into per-region execution graphs respecting cross-service dependencies.- Coordinate canary windows: instruct agents to deploy canary subset, collect metrics, run automated SLO checks via Metrics Evaluator, then decide (promote, pause, rollback).- Maintain global view and enforce safety policies (circuit-breakers, concurrency limits).

Agents:- Pull tasks from regional queue; execute idempotent primitives (pull image, apply k8s rollout, traffic-shift weights, run smoke tests).- Use local health probes and report back status and telemetry.- Implement step-level idempotency with operation tokens and compare-and-set semantics against control-plane state.

State model:- Orchestration is a finite-state machine (FSM) persisted in etcd: Plan -> RegionState{Pending, Canary, Verifying, Promoted, RolledBack, Failed}.- Each step has idempotent command record: {operation_id, desired_state, precondition_hash, result}.- Dependency edges annotated with ordering constraints (hard/soft) and parallelism limits.

Failure handling and rollback:- Failure types: - Agent transient failure: retry with exponential backoff; control-plane reassigns tasks to other healthy agent replicas in region. - Region-wide health degradation: Policy Engine trips region circuit-breaker -> freeze promotion in that region, optionally roll back partial changes there only. - Cross-region cascading risk: if automated SLO checks fail in canary region, stop promotion globally and initiate targeted rollback in affected regions only.- Rollback safety: - Rollbacks are scoped to lowest affected scope (instance -> AZ -> region) using dependency-aware reverse DAG to avoid global outage. - Rollback actions are also idempotent and staged: first revert traffic (shift away), then revert deployments, then verify. - Gradual rollback prevents region-wide outages by throttling concurrent rollback actions and using health gates.- Human-in-loop: automatic safeties require explicit approval for high-risk operations; emergency abort API exists.

Observability & testing:- End-to-end tracing of orchestration steps, audit logs, realtime dashboards.- Chaos tests and simulated partial-failure drills validate behavior.

Trade-offs:- Strong consistency (etcd) simplifies correctness at the cost of latency; acceptable in orchestration control plane.- Durability + idempotency complexity is justified for safety.

This design ensures safe, dependency-aware, region-scoped canaries and rollbacks, resilient to partial failures, with idempotent steps and provable failure isolation.

Deployment and Release StrategiesHardTechnical

139 practiced

Implement a simplified canary analysis evaluator in Python. Given two numeric time-series arrays (control and canary) representing error rates per minute over a 30-minute window, write a function that computes whether the canary is significantly worse than control using a bootstrap or permutation test (outline pseudo-code and complexity). Assume arrays of equal length.

Sample Answer

Approach:Use a permutation (randomization) or bootstrap test to estimate whether the canary error rate is significantly larger than control. Compute the observed difference in means (canary_mean - control_mean). Under the null (no difference), repeatedly shuffle labels (permutation) or resample with replacement (bootstrap) to build the null distribution and compute a one-sided p-value. Return decision at alpha (e.g., 0.05) plus effect size and CI.

python

import numpy as np

def canary_significance(control, canary, method='permutation', n_iter=10000, alpha=0.05, seed=None):
    """
    control, canary: 1D numeric arrays of equal length (e.g., 30)
    method: 'permutation' or 'bootstrap'
    n_iter: number of resamples
    Returns: dict with p_value, reject (bool), observed_diff, ci (95%), effect_size (Cohen's d)
    """
    rng = np.random.default_rng(seed)
    control = np.asarray(control)
    canary = np.asarray(canary)
    assert control.shape == canary.shape, "Arrays must be same shape"
    obs_diff = canary.mean() - control.mean()  # one-sided: is canary worse (higher)?
    pooled = np.concatenate([control, canary])

    null_diffs = np.empty(n_iter)
    n = len(control)

    if method == 'permutation':
        for i in range(n_iter):
            rng.shuffle(pooled)
            null_diffs[i] = pooled[n:].mean() - pooled[:n].mean()
    elif method == 'bootstrap':
        for i in range(n_iter):
            res_c = rng.choice(control, size=n, replace=True)
            res_k = rng.choice(canary, size=n, replace=True)
            null_diffs[i] = res_k.mean() - res_c.mean()
        # For bootstrap, to test null of no diff, center distribution:
        null_diffs -= null_diffs.mean()
    else:
        raise ValueError("method must be 'permutation' or 'bootstrap'")

    # one-sided p-value: fraction of null diffs >= observed diff
    p_value = np.mean(null_diffs >= obs_diff)
    reject = p_value < alpha

    # 95% bootstrap CI for the difference in means (using bootstrap resampling)
    # reuse bootstrap idea regardless of method for CI
    boot_diffs = np.empty(n_iter)
    for i in range(n_iter):
        res_c = rng.choice(control, size=n, replace=True)
        res_k = rng.choice(canary, size=n, replace=True)
        boot_diffs[i] = res_k.mean() - res_c.mean()
    ci_lower, ci_upper = np.percentile(boot_diffs, [100*alpha/2, 100*(1-alpha/2)])

    # Cohen's d for effect size (pooled SD)
    pooled_sd = np.sqrt(((control.var(ddof=1) + canary.var(ddof=1)) / 2))
    effect_size = (canary.mean() - control.mean()) / (pooled_sd + 1e-12)

    return {
        'p_value': float(p_value),
        'reject': bool(reject),
        'observed_diff': float(obs_diff),
        'ci_95': (float(ci_lower), float(ci_upper)),
        'effect_size_cohens_d': float(effect_size),
        'method': method,
        'n_iter': n_iter
    }

Key points:- Permutation test is exact under exchangeability; bootstrap estimates sampling distribution.- Use one-sided test because we care if canary is worse (higher error rate).- Report p-value, CI, and effect size for operational decisions.

Time & Space Complexity:- Time: O(n_iter * n) where n is series length (30) — practical for n_iter=1e4.- Space: O(n_iter) for null/boot arrays.

Edge cases:- Very small n (<=5): tests have low power.- Non-independent minutes (autocorrelation): both tests assume independence; consider block bootstrap if autocorrelation present.- Heavy-tailed distributions: increase n_iter or use robust summary (median) and adapt test accordingly.

Alternatives/Extensions:- Use medians and permutation of ranks (Wilcoxon) if non-normal.- Use block/bootstrap to respect temporal correlation.- Bayesian approach to estimate probability that canary rate > control by threshold.

Error Handling and Code QualityMediumTechnical

105 practiced

Write a robust Bash deployment script that performs a blue-green switch on Nginx upstreams. Requirements: use set -euo pipefail, validate input parameters, backup current configuration to a timestamped file, test the new config with nginx -t before switching, reload Nginx atomically, and rollback to the backup on failure. Include a trap to clean temporary files on exit.

Sample Answer

Approach: create a safe, idempotent Bash script that validates inputs, backups the current upstream file to a timestamped copy, writes a staged temporary config, tests with nginx -t, atomically swaps in the new file, reloads nginx, and rolls back on any failure. It uses set -euo pipefail, traps to clean temp files and handle errors.

bash

#!/usr/bin/env bash
set -euo pipefail

usage() {
  cat <<EOF
Usage: $0 <existing_upstream_conf> <new_upstream_conf>
Example: $0 /etc/nginx/conf.d/upstream.conf ./green_upstream.conf
EOF
  exit 2
}

# Validate args
if [ "${#}" -ne 2 ]; then usage; fi
EXISTING="${1}"
NEW="${2}"

if [ ! -f "$EXISTING" ]; then echo "ERROR: existing conf not found: $EXISTING" >&2; exit 1; fi
if [ ! -f "$NEW" ]; then echo "ERROR: new conf not found: $NEW" >&2; exit 1; fi
if ! command -v nginx >/dev/null 2>&1; then echo "ERROR: nginx not installed" >&2; exit 1; fi

TS=$(date +"%Y%m%dT%H%M%S")
BACKUP="${EXISTING}.bak.${TS}"
TMPDIR=$(mktemp -d)
TMP_NEW="${TMPDIR}/new.conf"

cleanup() {
  rm -rf "$TMPDIR" || true
}
trap cleanup EXIT

rollback() {
  echo "Rolling back to backup ${BACKUP}"
  if [ -f "$BACKUP" ]; then
    cp -- "$BACKUP" "$EXISTING"
    nginx -t && nginx -s reload
  else
    echo "No backup to rollback to" >&2
  fi
}
trap 'echo "Error encountered"; rollback; exit 1' ERR

# Backup current config
cp -- "$EXISTING" "$BACKUP"
echo "Backed up $EXISTING -> $BACKUP"

# Stage new config into temp and validate syntax by testing nginx with -c
cp -- "$NEW" "$TMP_NEW"

# Option A: Test global nginx with the new upstream included by copying into place in a test directory.
# Simpler: replace existing file atomically in a temp and test nginx (requires same include paths)
cp -- "$TMP_NEW" "${EXISTING}.tmp"
mv -T -- "${EXISTING}.tmp" "$EXISTING"

# Test nginx config
if nginx -t; then
  # Reload nginx gracefully
  nginx -s reload
  echo "Switched upstream and reloaded nginx successfully."
else
  echo "nginx -t failed after applying new config" >&2
  false  # trigger ERR trap -> rollback
fi

# Success: remove backup older than 7 days (optional)
find "$(dirname "$BACKUP")" -name "$(basename "$EXISTING").bak.*" -mtime +7 -delete || true

Key points:- set -euo pipefail for safer failure modes.- Validates input files and nginx presence.- Uses timestamped backup so rollback can restore exact state.- Tests nginx config before accepting change; reloads only on successful test.- ERR trap performs rollback; EXIT trap cleans temp files.- Atomic replace uses mv -T to avoid partial writes.Edge cases: ensure proper permissions (run as root), remote reload via systemctl if desired (replace nginx -s reload with systemctl reload nginx), and environment with included upstreams must allow replacing the single file.

Monitoring Tools and ObservabilityEasyTechnical

89 practiced

Explain the difference between head-based and tail-based sampling for distributed tracing. Provide one scenario where tail-based sampling is strongly preferred, and one where head-based sampling is acceptable.

Linux Process and Service ManagementEasyTechnical

18 practiced

The /proc filesystem contains runtime state about processes and the kernel. For a PID you suspect of leaking resources, list which /proc files you would inspect (for example cmdline, environ, status, fd, io, limits, smaps) and explain what each file reveals and how you would use it to diagnose the problem.

Sample Answer

For investigating a PID that may be leaking resources, I’d inspect these /proc entries and use them as follows:

- /proc/<pid>/cmdline — shows the exact invoked command and args. Confirms the binary and runtime flags; helps match process to service and reproduce locally.

- /proc/<pid>/environ — environment variables (PATH, LD_PRELOAD, memory tunables). Useful for detecting unusual env settings or injected libraries.

- /proc/<pid>/status — human-readable summary: VmRSS, VmSize, Threads, voluntary/involuntary context switches, NFD (number of FDs). Quick snapshot of memory, thread count, and open-FD hints.

- /proc/<pid>/fd/ (symlinks) — lists file descriptors and targets (sockets, files, pipes). Identify leaked FDs, long-lived sockets, or files preventing cleanup.

- /proc/<pid>/io — cumulative I/O counters (read_bytes, write_bytes, syscalls). Use to spot abnormal read/write patterns or heavy swapping activity.

- /proc/<pid>/limits — process resource limits (RLIMIT_AS, RLIMIT_NOFILE). Check if limits are low and causing failures or if set too high allowing runaway growth.

- /proc/<pid>/smaps (or smaps_rollup) — detailed per-VMA memory mappings: RSS, PSS, private/shared dirty. Pinpoints which anonymous or mapped regions consume memory and whether memory is shared or truly private (leak).

- /proc/<pid>/maps — address-to-file mappings; correlate with smaps to find which libraries or mmaps correspond to large regions.

- /proc/<pid>/stack and /proc/<pid>/task/*/stack — kernel stacks to see sleeping/hung threads.

How I use them together: start with status+limits+cmdline to scope; check fd and maps/smaps to locate leaked handles or large anonymous allocations; use io and status counters over time (polling) to see trends; examine environ and fd targets for resource misconfiguration; if smaps shows growing private anonymous pages, suspect heap leak; if many sockets/FDs persist, trace code path holding descriptors. Combine with lsof, strace or heap profilers for root cause.

Database Selection and Trade OffsMediumTechnical

36 practiced

Explain how you would architect a highly available PostgreSQL deployment for a write-heavy OLTP workload (5k writes/s) with RPO <= 1 minute. Cover primary/standby topology, synchronous vs asynchronous replication trade-offs, failover automation tooling, monitoring alerts, and backup/point-in-time recovery strategy to meet RTO/RPO.

Sample Answer

Requirements & constraints:- Write-heavy OLTP ~5k writes/s (sustained), RPO ≤ 1 minute, high availability with automated failover and short RTO (minutes).- Priorities: durability (RPO), availability (fast failover), performance (IO/CPU), operational safety (no split-brain).

High-level topology:- Single Primary (writer) + 2 synchronous-capable standbys in different AZs + 2+ asynchronous replicas. - Synchronous standbys (quorum = 1 or quorum=majority): at least one sync standby must ACK WAL to meet RPO ≤ 1 min. - Asynchronous replicas for read-scaling, backups, and analytics to avoid adding commit latency to writer.

Replication choices & trade-offs:- Synchronous replication: - Pros: Lower RPO (commits acknowledged on standby), safer for durability. - Cons: Higher commit latency (depends on network/standby apply speed), risk of reduced throughput if standby slow. - Mitigation: Use synchronous_commit = on (or synchronous_commit = remote_apply for stronger guarantees if supported), place sync standbys on low-latency links, tune wal_sender / wal_receiver, use wait_for_wal to reduce stalls.- Asynchronous replication: - Pros: No commit latency impact; good for scaling reads and backups. - Cons: Potential WAL lag -> RPO gap; cannot guarantee <1-min without additional measures.- Recommended: Configure primary to require ACK from one synchronous standby for commit (synchronous_commit=on with primary_conninfo/synchronous_standby_names="FIRST 1 (<node1>,<node2>)"). Keep additional async replicas.

Failure detection & automated failover:- Use Patroni (with etcd/consul/zookeeper) or repmgr + distributed consensus for leader election + fencing. Patroni is common for automated failover, leader election and reconfiguration.- Ensure automatic failover safety: - Use quorum-based election (etcd/consul) to avoid split-brain. - Configure fencing/maintenance mode for network partitions. - Use synchronous_standby_names to prioritize which node becomes primary after promotion.- Promote standby with transaction replay complete (remote_apply) to avoid data loss; use fast promotion but ensure WAL flush.

Service routing & RTO:- Use a virtual IP managed by keepalived or an external TCP load balancer/HAProxy that points to the current primary; or update DNS with short TTL plus automation that updates records on failover.- Aim RTO: automated failover + load balancer switch < 1–2 minutes. Test regularly.

Backup & PITR to meet RPO/RTO:- Continuous WAL archiving + periodic base backups: - Use pgbackrest or wal-g for reliable base backups and WAL shipping to S3 (or object store). - Configure archive_command to push WAL segments immediately; use compression.- PITR strategy: - Retain WAL segments sufficient to restore to any point within retention window (e.g., 7–30 days), but for RPO use replication for recent recovery. - To meet RPO ≤ 1 min, rely on sync standby for last-minute durability; WAL archive provides longer-term recovery and recovery from logical errors.- Regularly test restores and PITR; maintain scripts and runbooks.

Configuration & tuning for 5k writes/s:- Storage/IO: provision NVMe or very fast disks with predictable IOPS. Monitor WAL write throughput and disk latency.- Postgres settings: - wal_level = replica - max_wal_senders >= number of replicas - wal_compression = on (if CPU allows) - checkpoint_timeout / max_wal_size tuned to avoid heavy checkpoint spikes; increase background_writer / checkpoint_completion_target to smooth IO. - synchronous_commit = on (or remote_apply) for durability - use replication_slots for async replicas but monitor slot lag to avoid disk growth.- Connection pooling (PgBouncer) to reduce backend connection pressure.

Monitoring & alerts:- Key metrics and alert thresholds: - replication/WAL lag (bytes and time) — alert if > 30s, critical if > 60s. - commit latency (p99) — alert if increases beyond baseline. - disk latency (read/write ms) — alert on sustained high IOPS latency (>10–20ms depending on storage). - CPU, memory, swap usage. - checkpoint duration & frequency (frequent long checkpoints). - long-running transactions and open transactions > threshold. - WAL archive failures, backlog, and archive latency. - replication slots size growth. - failover/promotion events and node unreachable.- Tools: Prometheus + pg_exporter for metrics, Grafana dashboards, Alertmanager for dedup/notification. Integrate runbooks into alerts.

Recovery & runbook:- Automated failover: Patroni promotes sync standby; load balancer updates endpoints. If automatic fails (partition, no quorum), manual intervention steps.- Post-failover checks: validate replication resume, verify replica apply lag 0, run smoke tests, check data integrity.- Restore from backup/PITR: use pgbackrest restore or base backup + restore WAL to chosen timestamp. For RTO targets longer than failover window, document expected restore times and test.

Operational practices:- Regular failure drills (chaos testing), backup/restore drills, and capacity testing for 5k writes/s.- Maintain runbooks for failover, rollback, and emergency PITR.- Keep maintenance windows for major upgrades; use logical replication for zero-downtime schema changes where possible.

Summary:- Use at least one synchronous standby in a separate AZ to meet RPO ≤ 1 minute, with async replicas for scale and backups.- Automate failover with Patroni + etcd and route traffic via VIP/load balancer.- Continuous WAL archiving + pgbackrest/wal-g for PITR; tune WAL/config for high write throughput.- Monitor WAL lag, commit latency, disk I/O, and replication health; run frequent drills to ensure RTO/RPO targets are met.

Incident Management and ResponseMediumTechnical

69 practiced

Write a Python 3 script that reads a newline-delimited log file where each line is a JSON object with keys: "timestamp" (ISO 8601), "service", "level", "message". The script should output per-minute error counts (level == "ERROR") for a given service over the last 60 minutes, printing lines like: 2025-03-12 14:05 3. The log may be out-of-order and can be large (~10GB): prioritize streaming and bounded memory.

Sample Answer

Approach:- Stream the file line-by-line, parse JSON, filter by service and level.- Normalize timestamps to UTC minute buckets (truncate seconds).- Maintain a dict of counts keyed by minute; keep only the last 60 minutes relative to the maximum timestamp seen so far to bound memory.- After processing, print per-minute counts for the 60-minute window in chronological order, including zero counts.

python

#!/usr/bin/env python3
import sys
import json
from collections import defaultdict
from datetime import datetime, timedelta, timezone

WINDOW_MINUTES = 60

def parse_iso8601(s):
    # Accept e.g. "2025-03-12T14:05:23Z" or with offset
    if s.endswith("Z"):
        s = s[:-1] + "+00:00"
    return datetime.fromisoformat(s)

def minute_bucket(dt):
    # Ensure timezone-aware (convert naive to UTC)
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    dt = dt.astimezone(timezone.utc)
    return dt.replace(second=0, microsecond=0)

def purge_old(counts, newest_minute):
    cutoff = newest_minute - timedelta(minutes=WINDOW_MINUTES - 1)
    # remove keys older than cutoff
    to_delete = [k for k in counts if k < cutoff]
    for k in to_delete:
        del counts[k]

def main(log_path, target_service):
    counts = defaultdict(int)
    newest_minute = None

    with open(log_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                obj = json.loads(line)
            except json.JSONDecodeError:
                continue  # skip malformed lines
            if obj.get("service") != target_service:
                continue
            if obj.get("level") != "ERROR":
                continue
            ts = obj.get("timestamp")
            if not ts:
                continue
            try:
                dt = parse_iso8601(ts)
            except Exception:
                continue
            mb = minute_bucket(dt)
            counts[mb] += 1
            if (newest_minute is None) or (mb > newest_minute):
                newest_minute = mb
                purge_old(counts, newest_minute)

    if newest_minute is None:
        # No matching entries; print last 60 minutes relative to now
        newest_minute = minute_bucket(datetime.now(timezone.utc))

    start = newest_minute - timedelta(minutes=WINDOW_MINUTES - 1)
    cur = start
    while cur <= newest_minute:
        print(cur.strftime("%Y-%m-%d %H:%M"), counts.get(cur, 0))
        cur += timedelta(minutes=1)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: script.py /path/to/logfile service_name", file=sys.stderr)
        sys.exit(2)
    main(sys.argv[1], sys.argv[2])

Key points:- Streaming parse keeps memory bounded; counts dict holds at most ~60 keys.- Handles out-of-order logs by tracking the newest minute seen and purging older buckets.- Uses timezone-aware UTC minute buckets to avoid DST/offset issues.

Complexity:- Time: O(N) for N lines.- Space: O(WINDOW_MINUTES) (bounded ~60 keys).

Edge cases:- Malformed JSON or missing fields are skipped.- If no matching ERRORs appear, prints 60 minutes ending at current UTC minute.- If timestamps are from different timezones, they’re normalized to UTC.

Automation and ScriptingHardTechnical

77 practiced

Explain safe database schema migration strategies suitable for automated deployments: expand-contract patterns, feature flags, online schema change tools, blue-green approaches, and how to coordinate application rollout with schema changes to avoid downtime and incompatible reads/writes.

Sample Answer

Requirements & constraints:- Zero or minimal downtime, no incompatible reads/writes during rollout, automated deployments, ability to rollback, support for long-lived transactions and replicas.

Safe patterns and tools (high-level):

1) Expand–Contract pattern (recommended)- Phase 1 (Expand): deploy additive, backward-compatible schema changes (new columns, tables, indexes) and deploy application code that writes both old and new formats or only writes new fields while still reading old ones.- Phase 2 (Migrate data): backfill/populate new columns asynchronously (online jobs with rate limits and idempotency).- Phase 3 (Switch reads): update app to read from new schema exclusively (toggle via config/feature flag).- Phase 4 (Contract): once traffic is stable and no writes/read from old fields, remove legacy columns/indexes in a safe deploy.

Why: additive steps avoid breaking old code; separates schema change risk from code change risk.

2) Feature flags & phased rollout- Use feature flags to toggle app behavior (reads/writes) per-service, per-region, or per-instance. Start with a small % (canary), monitor error/latency/SLOs, then ramp.- Flags decouple deployment of code from behavioral switch; allow immediate rollback without DB schema operations.

3) Online schema change tools- Use native tools that do non-blocking DDL: pt-online-schema-change, gh-ost (MySQL), pg_repack/ALTER ... CONCURRENTLY (Postgres).- Ensure tools: honor FK/constraints, use low-impact locks, support primary key changes via shadow table + cutover, and have pause/abort mechanisms.- Monitor replication lag, table-level locks, disk use for shadow tables.

4) Blue-Green / Shadow Traffic approaches- Deploy new app version pointing to new schema endpoints in a green environment, send a subset of traffic (or shadow copies) to validate behavior without affecting production state.- For writes, use dual-write or publish-subscribe to ensure both schemas see changes; verify idempotency and eventual consistency.

Coordination & automation best practices- Automate migration as part of CI/CD pipeline with explicit migration stages: pre-checks, dry-run, canary, monitor, cutover, cleanup.- Pre-deployment checks: verify schema compatibility matrix, run DB size/estimate, ensure backups and fast rollback plan.- Use migration orchestration: lock step via migration versioning (e.g., Flyway/liquibase) and deployment orchestration that gates app rollout until DB expand phase completes.- Observability: instrument migrations with metrics (replication lag, DDL duration, row-copy rate, error rates) and create runbooks for abort/rollback.- Limits and throttling: enforce rate limits on backfill to avoid saturation; use transactional boundaries to avoid long locks.- Handle long-lived transactions: schedule maintenance windows for problematic migrations or use application-level migration that tolerates mixed schema.

Common pitfalls & mitigations- Dropping columns too early → verify zero usage via metrics/tracing before contract.- Relying on dual-writes without ensuring idempotency → ensure dedupe and reconciliation jobs.- High disk usage from shadow tables → estimate and throttle.- Replication lag causing stale reads → monitor and slow backfill if lag increases.

Example concise flow1. Add nullable column + index (expand) via online tool.2. Deploy app version A that writes both old and new fields (dual-write off by default).3. Backfill historical rows with an asynchronous job.4. Flip feature flag to read new column in canary fleet; monitor.5. Ramp reads/writes to new column, disable old-path feature flag.6. Remove old column in a final safe migration (contract).

This approach minimizes coupling between schema and code, enables fast rollback via feature flags, and uses online tools and observability to avoid downtime while automating deployments.

Deployment and Release StrategiesEasyTechnical

98 practiced

Explain the 'recreate' deployment strategy and compare it to rolling updates. Provide examples of when recreate might still be used and its implications for availability and complexity.

Practice Site Reliability Engineer (SRE) questions across all topics

Additional Information

Want to create your own tailored preparation guide using our deep research?

Get Started for Free

Interview-Ready Courses

Visual-first, interactive, structured learning paths

Browse Site Reliability Engineer (SRE) jobs

AI-enriched listings across hundreds of company career pages

Explore Jobs

Apple Site Reliability Engineer (Mid-Level) Interview Preparation Guide 2026

Interview Process Overview

Interview Rounds

Recruiter Screening

What to Expect

Tips & Advice

Focus Topics

Apple's Reliability Standards & Products

Practice Interview

Study Questions

Specific Projects & Measurable Impact

Practice Interview

Study Questions

Career Narrative & SRE Background

Practice Interview

Study Questions

Technical Skills & Tech Stack Proficiency

Practice Interview

Study Questions

Technical Phone Screen 1: Linux Systems & Troubleshooting

What to Expect

Tips & Advice

Focus Topics

/proc Filesystem Navigation & System State Inspection

Practice Interview

Study Questions

System Performance Analysis & Bottleneck Identification

Practice Interview

Study Questions

Process Management & Process Lifecycle

Practice Interview

Study Questions

Memory Management & Virtual Memory

Practice Interview

Study Questions

Systematic Linux Troubleshooting Methodology

Practice Interview

Study Questions

Technical Phone Screen 2: Networking & Protocols

What to Expect

Tips & Advice

Focus Topics

Load Balancing Strategies & Traffic Distribution

Practice Interview

Study Questions

Network Troubleshooting & Diagnostic Tools

Practice Interview

Study Questions

DNS Resolution & Service Discovery Reliability

Practice Interview

Study Questions

TCP/IP Fundamentals & Connection Reliability

Practice Interview

Study Questions

HTTPS/TLS Security & Connection Handling

Practice Interview

Study Questions

Onsite Round 1: Systems Internals Deep Dive

What to Expect

Tips & Advice

Focus Topics

System Performance Tuning & Kernel Parameters

Practice Interview

Study Questions

I/O Subsystem & Storage Reliability

Practice Interview

Study Questions

Advanced Memory Management & Kernel Memory Subsystem

Practice Interview

Study Questions

Process Scheduling & CPU Management

Practice Interview

Study Questions

Linux Kernel Architecture & Core Subsystems

Practice Interview

Study Questions

Onsite Round 2: SRE Practices & Observability

What to Expect

Tips & Advice

Focus Topics