Cloud & Infrastructure Topics
Cloud platform services, infrastructure architecture, Infrastructure as Code, environment provisioning, and infrastructure operations. Covers cloud service selection, infrastructure provisioning patterns, container orchestration (Kubernetes), multi-cloud and hybrid architectures, infrastructure cost optimization, and cloud platform operations. For CI/CD pipeline and deployment automation, see DevOps & Release Engineering. For cloud security implementation, see Security Engineering & Operations. For data infrastructure design, see Data Engineering & Analytics Infrastructure.
Infrastructure Strategy and Platform Decisions
Focuses on making technical infrastructure and platform choices with consideration for business impact and organizational factors. Topics include build versus buy trade offs, vendor and platform evaluation, scalability and reliability considerations, migration and deprecation planning for legacy systems, total cost of ownership, developer productivity impact, organizational readiness, and stakeholder involvement. Candidates should show how to structure these decisions, evaluate technical and non technical risks, and communicate clear rationale and implementation plans.
Your SRE Background and Experience
Articulate your hands-on experience with systems administration, monitoring tools, automation scripts, and any incident response involvement. Be specific about technologies (e.g., Prometheus, Grafana, Kubernetes, Docker, Terraform) and concrete examples of what you've built or fixed.
Technical Vision and Infrastructure Roadmap
This topic assesses a candidate's ability to define a multi year technical vision for infrastructure, platform, and systems and to translate that vision into a practical execution roadmap. Core skills include evaluating technology choices and architecture evolution, planning migration and modernization paths, anticipating scalability and capacity needs, and balancing cost performance with resilience and operational reliability. Candidates should demonstrate approaches to managing technical debt, sequencing investments across quarters and releases, estimating resources and timelines, establishing measurable infrastructure goals and key performance indicators, and implementing governance and standards. Discussion may also cover reliability and observability, security and compliance considerations, trade offs between short term stability and long term rearchitecture, prioritization to enable business outcomes, and communicating technical trade offs to both technical and non technical stakeholders.
Platform Architecture for Organizational Scale
Designing internal platforms and infrastructure to support large engineering organizations and evolving teams. Topics include developer experience and self service platform design, deployment platforms that enable safe frequent releases for hundreds of engineers, platform automation and observability patterns that provide cross service visibility, governance and operational policies, service onboarding and lifecycle, and how to evolve platform capabilities as headcount and service count grows. Candidates should discuss trade offs between centralized platform services and team autonomy, metrics for platform health, and approaches to encourage adoption while minimizing operational friction.
Platform and Infrastructure Strategy
Covers how to design, build, and operate shared platforms and infrastructure that enable multiple product teams. Topics include defining platform scope and charter, developer experience and adoption strategies, API and service contracts, observability and reliability practices, service level objectives and service level agreements, cost and capacity planning, security and compliance for shared services, platform governance and onboarding, measuring platform health and return on investment, and migration strategies for teams moving to platform primitives versus bespoke implementations.
Technology and Platform Selection
Evaluation and justification of technologies services and platforms used to implement systems across the stack. Candidates should be able to select compute options including virtual machines containers and serverless platforms as well as orchestration and workflow engines messaging systems batch and streaming processing engines object and block storage data warehouses and other data platforms. The topic encompasses comparing managed services and self managed deployments cloud versus on premise hosting and choosing frameworks runtimes and overall stacks based on workload characteristics. Assessment focuses on weighing trade offs across cost operational overhead reliability latency and throughput scaling characteristics vendor lock in development velocity team familiarity and learning curve maturity and community support security and compliance and monitoring and debugging complexity. Candidates should demonstrate how system requirements map to service capabilities justify build versus buy decisions and managed service choices design proof of concept experiments and outline migration and rollout planning while making pragmatic choices that balance performance cost and operational risk.
Infrastructure Scaling and Capacity Planning
Operational and infrastructure level planning to ensure systems meet current demand and projected growth. Topics include forecasting demand headroom planning and three to five year capacity roadmaps; autoscaling policies and metrics driven scaling using central processing unit memory and custom application metrics; load testing benchmarking and performance validation methodologies; cost modeling and right sizing in cloud environments and trade offs between managed services and self hosted solutions; designing non disruptive upgrade and migration strategies; multi region and availability zone deployment strategies and implications for data placement and latency; instrumentation and observability for capacity metrics; and mapping business growth projections into infrastructure acquisition and scaling decisions. Candidates should demonstrate how to translate requirements into capacity plans and how to validate assumptions with experiments and measurements.
Infrastructure Operations and Technical Debt
Infrastructure Operations and Technical Debt covers practical ownership of physical and virtual infrastructure and how to manage long term maintainability. Topics include server hardware fundamentals, storage and RAID concepts, out of band management and hardware monitoring, capacity planning, lifecycle and vendor management, and how to identify, prioritize, and remediate technical debt in infrastructure. Candidates should be able to discuss trade offs when balancing infrastructure investment with feature work, refactoring or automation strategies to reduce operational burden, and techniques for planning upgrades or migrations with minimal disruption.
Capacity Planning and Forecasting
Covers forecasting demand and planning infrastructure and platform capacity to meet expected business needs reliably and cost effectively. Candidates should be able to analyze historical usage and growth trends, build and validate capacity models, define capacity metrics and thresholds, estimate headroom and safety margins, and translate business growth scenarios into procurement or cloud provisioning plans and timelines. Includes storage and compute lifecycle planning such as archiving and retention strategies, upgrade and rollout planning to avoid disruption, and trade offs between overprovisioning and right sizing. Also addresses design for scale and redundancy, autoscaling and elasticity patterns, load balancing and failover planning, capacity testing and stress testing, monitoring and alerting for capacity signals, and techniques to measure and improve forecast accuracy. Finally it covers operational governance and decision making including cross team resource allocation, capacity reviews, cost optimization and budgeting, runbooks and change control, and alignment of capacity plans with service level objectives and business projections.