InterviewStack.io LogoInterviewStack.io

Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

HardTechnical
0 practiced
Design and implement (pseudocode or Python) a feedback controller that dynamically adjusts batching window (max_wait_ms) and max batch size at runtime to meet a target p95 latency SLO under fluctuating traffic. Describe the control loop, choice of metrics, stability considerations (avoid oscillations), safety limits, and how to prevent harmful parameter changes during sudden spikes.
MediumTechnical
0 practiced
Estimate monthly cost for an image-classification inference service given: steady average 500 RPS, average request upload 50KB, response 10KB, 30% of traffic egresses to EU from us-east-1, GPU instances cost $2.5/hr and provide 100 RPS at 70% utilization, model storage is 200GB at $0.02/GB-month. Break down compute, egress, and storage costs monthly and propose a plan to reduce total cost by ~30% with concrete levers.
MediumTechnical
0 practiced
Compare pruning, knowledge distillation, and quantization as techniques to reduce model size. For each technique, explain expected impacts on inference latency, memory footprint, training/inference complexity, and accuracy. Provide guidance on which technique to try first given a strict latency target and limited engineering budget.
MediumSystem Design
0 practiced
Design an inference service for a binary classification model that must support 10,000 QPS peak, p95 latency <100ms, and a monthly cloud budget of $5,000 for inference compute and egress. Describe system components, autoscaling strategy, batching and caching decisions, model optimization options, and an approach to estimate instance counts and expected cost.
MediumTechnical
0 practiced
Write a Python function that selects the cheapest instance type (CPU or GPU) and the required instance count given: per-instance throughput and hourly cost for CPU and GPU, predicted average request rate, and latency SLO (max latency). Assume a per-instance utilization target (e.g., 70%). Include comments describing assumptions and how you handle rounding/over-provisioning.

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.