InterviewStack.io LogoInterviewStack.io

Infrastructure Implementation and Operations Questions

Hands on design, deployment, and operational management of infrastructure components and services. This includes setting up and configuring load balancers, database replication and high availability, caching layers, networking and network security, service discovery and routing, container deployment and orchestration, monitoring and observability, logging and alerting, backup and disaster recovery strategies, and secrets management in runtime. Candidates should be able to walk through concrete implementations, explain trade offs, demonstrate troubleshooting and performance tuning, and show how infrastructure components integrate to meet availability, scalability, and security requirements.

MediumTechnical
0 practiced
Compare at-least-once, at-most-once, and exactly-once delivery semantics in streaming systems. For a deduplication streaming job that writes to a data lake, which semantics would you aim for, and how would you implement it using Kafka and idempotent sinks?
HardTechnical
0 practiced
Write a Prometheus recording rule and an alerting rule (YAML) that calculates the 99th-percentile processing latency for metric 'etl_processing_seconds' over a 1h sliding window and fires an alert if the 99p exceeds 5s for three consecutive evaluation intervals. Provide both rules only.
HardTechnical
0 practiced
You must convince product and security stakeholders to approve a multi-day read-only maintenance window for a major metadata migration. Prepare a stakeholder communication plan that outlines risk mitigations, rollback procedures, business impact assessment, verification steps post-migration, and KPIs that will demonstrate success.
EasyTechnical
0 practiced
Explain the role of a secrets manager (e.g., AWS Secrets Manager or HashiCorp Vault) in infrastructure. What are best practices for handling DB credentials used by ETL jobs and for rotating secrets with minimal disruption?
HardTechnical
0 practiced
Your primary cloud region experiences a catastrophic outage that corrupts the primary metadata store for the data lake. Walk through a detailed recovery plan to restore data availability and integrity across analytics clusters, including verification steps, timelines, coordination with stakeholders, and how you'd prevent reoccurrence.

Unlock Full Question Bank

Get access to hundreds of Infrastructure Implementation and Operations interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.