InterviewStack.io LogoInterviewStack.io

Cost Optimization at Scale Questions

Addresses cost conscious design and operational practices for systems operating at large scale and high volume. Candidates should discuss measuring and improving unit economics such as cost per request or cost per customer, multi tier storage strategies and lifecycle management, caching, batching and request consolidation to reduce resource use, data and model compression, optimizing network and input output patterns, and minimizing egress and transfer charges. Senior discussions include product level trade offs, prioritization of cost reductions versus feature velocity, instrumentation and observability for ongoing cost measurement, automation and runbook approaches to enforce cost controls, and organizational practices to continuously identify, quantify, and implement savings without compromising critical service level objectives. The topic emphasizes measurement, benchmarking, risk assessment, and communicating expected savings and operational impacts to stakeholders.

HardSystem Design
0 practiced
Design a cost-optimized inference platform for a 70B-parameter LLM that must serve 1M requests/day with a 200ms P95 SLO and a target cost of under $0.50 per 1k requests. Discuss model serving strategy (sharding, tensor-slicing, quantization), caching, autoscaling, and hardware selection. Provide estimated trade-offs.
EasyTechnical
0 practiced
Explain differences between Kubernetes HPA (Horizontal Pod Autoscaler), Cluster Autoscaler, and VPA (Vertical Pod Autoscaler). For each, state one cost optimization scenario where it helps and one potential cost pitfall.
MediumTechnical
0 practiced
Write a Python function simulate_cost(batch_sizes: List[int], qps: int, latency_budget_ms: int, cost_per_gpu_hour: float) that estimates monthly cost for inference given different batch sizes assuming fixed GPU latency per batch. Provide the function signature and brief algorithmic description; you do not need to implement numeric integration of jitter—describe assumptions in comments.
MediumSystem Design
0 practiced
Design a multi-tier storage plan for a 5 PB ML dataset used for training and analytics. Requirements: keep last 90 days hot for training, older data accessible within 6 hours for retraining, cost target 50% lower than keeping all on SSD, and minimal operational overhead. List components, lifecycle policies, and expected trade-offs.
HardTechnical
0 practiced
Implement a Python function pack_requests(requests: List[int], max_batch_tokens: int, latency_budget_ms: int) that greedily packs variable-length requests (token counts) into batches without exceeding max_batch_tokens. Describe how you would modify the greedy approach to respect a latency budget per request assuming batch processing time grows with total tokens.

Unlock Full Question Bank

Get access to hundreds of Cost Optimization at Scale interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.