Cloud & Infrastructure Topics
Cloud platform services, infrastructure architecture, Infrastructure as Code, environment provisioning, and infrastructure operations. Covers cloud service selection, infrastructure provisioning patterns, container orchestration (Kubernetes), multi-cloud and hybrid architectures, infrastructure cost optimization, and cloud platform operations. For CI/CD pipeline and deployment automation, see DevOps & Release Engineering. For cloud security implementation, see Security Engineering & Operations. For data infrastructure design, see Data Engineering & Analytics Infrastructure.
Your SRE Background and Experience
Articulate your hands-on experience with systems administration, monitoring tools, automation scripts, and any incident response involvement. Be specific about technologies (e.g., Prometheus, Grafana, Kubernetes, Docker, Terraform) and concrete examples of what you've built or fixed.
Networking Fundamentals and Troubleshooting
Comprehensive coverage of core computer networking principles and the practical diagnostic and operational skills required to design, operate, and troubleshoot production systems. Fundamental concepts include the Open Systems Interconnection model layers, the Transmission Control Protocol and the Internet Protocol stack, the User Datagram Protocol, socket and port semantics, address notation and subnetting, Network Address Translation, Dynamic Host Configuration Protocol, and the Domain Name System resolution process. Infrastructure and architectural topics include switching and virtual local area networks, routing concepts and routing table behavior including Border Gateway Protocol basics, load balancing strategies and failure modes, firewall and access control, virtual private network technologies, and container and service network communication patterns. Diagnostic and tooling skills cover connectivity testing and path analysis, process and socket inspection, packet capture and analysis, and common command line tools and utilities used for network investigation. Performance and reliability topics include latency, bandwidth and throughput, packet loss, congestion and congestion control, connection pooling, timeout and retry strategies, and approaches to optimization. Observability, monitoring, and security practices include collecting and interpreting network metrics, logs, and traces, using packet capture tools for root cause analysis, and understanding how network issues surface in distributed applications. At senior levels expect discussion of network performance tuning, capacity planning, load balancer behavior at scale, and design decisions that affect system reliability and security.
Scalability and Systems Resource Management
Design and operational practices for managing compute and platform resources as systems scale. Covers autoscaling, resource pooling, orchestration, cost trade offs between always on versus on demand provisioning, and architectural choices that affect resource utilization and performance. Candidates should be prepared to discuss capacity planning for infrastructure, metrics and alerts for autoscaling, and cost versus performance decisions for high availability systems.
Understanding the Company's Infrastructure Context
Research the company's public infrastructure information (engineering blog, tech talks, published case studies, job description). Understand what systems they operate at scale, what problems they likely face, and what your role would contribute to.
Infrastructure Automation and Provisioning
Covers designing, implementing, and operating automated infrastructure provisioning and configuration using Infrastructure as Code practices and complementary automation patterns. Candidates should be able to select and author declarative infrastructure definitions with tools such as Terraform, CloudFormation, and Azure Resource Manager templates, and discuss configuration management tools such as Ansible, Puppet, or Chef. Core skills include modular and reusable code organization for multiple environments, variable and output management, remote state management and locking, idempotency and atomicity of operations, and version control integration for infrastructure artifacts. Candidates should understand testing and validation practices including linting, plan or dry run validation, unit and integration testing of infrastructure changes, and drift detection and remediation. The topic includes strategies for safe changes and rollbacks, change coordination, error handling and recovery, and deployment patterns such as canary and blue green where applicable. It also encompasses automation and orchestration patterns, immutable infrastructure and self healing practices, autoscaling and scaling policies, automated patching and updates, secrets handling patterns using secret managers, and integrating observability and monitoring into automated workflows. Finally, candidates should be able to reason about trade offs between imperative and declarative approaches, scaling Infrastructure as Code across large projects and teams, and security and compliance considerations for automated provisioning.
Team Infrastructure Challenges and Priorities
Understand the specific infrastructure problems the team is facing, current technical priorities, and the direction of ongoing projects. Topics include the team's roadmap, high priority infrastructure improvements, common operational pain points, technical debt, team bandwidth constraints, and metrics for early success in the first six to twelve months. Candidates should be able to discuss likely trade offs, propose pragmatic first steps, and show awareness of organizational and operational factors that affect infrastructure work.
Network Protocol Internals and Edge Cases
Deep understanding of how protocols work internally (TCP congestion control, IP fragmentation, DNS resolution under failure, IPv6 transition), edge cases that cause problems, and protocol troubleshooting.
Capacity Planning and Forecasting
Covers forecasting demand and planning infrastructure and platform capacity to meet expected business needs reliably and cost effectively. Candidates should be able to analyze historical usage and growth trends, build and validate capacity models, define capacity metrics and thresholds, estimate headroom and safety margins, and translate business growth scenarios into procurement or cloud provisioning plans and timelines. Includes storage and compute lifecycle planning such as archiving and retention strategies, upgrade and rollout planning to avoid disruption, and trade offs between overprovisioning and right sizing. Also addresses design for scale and redundancy, autoscaling and elasticity patterns, load balancing and failover planning, capacity testing and stress testing, monitoring and alerting for capacity signals, and techniques to measure and improve forecast accuracy. Finally it covers operational governance and decision making including cross team resource allocation, capacity reviews, cost optimization and budgeting, runbooks and change control, and alignment of capacity plans with service level objectives and business projections.
Linux System Administration
Linux specific system administration and deep operating system topics. Areas include Linux kernel concepts, process lifecycle and signals, memory management and swap behavior in Linux, Linux file systems and permission models, boot processes and init systems such as systemd, package management and software installation, service management and system daemons, shell and scripting for automation and debugging, performance tuning and profiling, log management and diagnostic techniques, security and access control on Linux, and approaches to investigating and resolving systemic failures in Linux environments. At senior levels candidates should demonstrate both operational competence and an understanding of internal mechanisms and trade offs.