InterviewStack.io LogoInterviewStack.io

Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

MediumTechnical
45 practiced
Compare pruning, knowledge distillation, and quantization as techniques to reduce model size. For each technique, explain expected impacts on inference latency, memory footprint, training/inference complexity, and accuracy. Provide guidance on which technique to try first given a strict latency target and limited engineering budget.
MediumTechnical
45 practiced
Write a Python function that selects the cheapest instance type (CPU or GPU) and the required instance count given: per-instance throughput and hourly cost for CPU and GPU, predicted average request rate, and latency SLO (max latency). Assume a per-instance utilization target (e.g., 70%). Include comments describing assumptions and how you handle rounding/over-provisioning.
MediumTechnical
80 practiced
Given a small Transformer model where average token generation cost is 5ms on GPU versus 20ms on CPU, GPUs cost 5x per hour relative to CPUs, and expected inference workload is 200 tokens/sec, describe how you would decide to serve on GPU or CPU. Include calculations for cost-per-token, batching effects, latency SLO considerations, and how utilization affects the decision.
MediumTechnical
56 practiced
You receive a black-box ML inference pipeline: client serialization → network → server preprocessing → model on GPU → postprocessing. Describe a detailed production profiling plan that measures latency contribution of each stage with minimal overhead. Include sampling, distributed tracing strategy, tools for GPU kernel timing, and how you'll correlate traces to identify the dominant contributor to tail latency.
HardTechnical
58 practiced
Design and implement (pseudocode or Python) a feedback controller that dynamically adjusts batching window (max_wait_ms) and max batch size at runtime to meet a target p95 latency SLO under fluctuating traffic. Describe the control loop, choice of metrics, stability considerations (avoid oscillations), safety limits, and how to prevent harmful parameter changes during sudden spikes.

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.