InterviewStack.io LogoInterviewStack.io

Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

MediumTechnical
80 practiced
Given a small Transformer model where average token generation cost is 5ms on GPU versus 20ms on CPU, GPUs cost 5x per hour relative to CPUs, and expected inference workload is 200 tokens/sec, describe how you would decide to serve on GPU or CPU. Include calculations for cost-per-token, batching effects, latency SLO considerations, and how utilization affects the decision.
HardTechnical
58 practiced
During an A/B experiment, variant B shows 2x p99 latency compared to control. Outline a thorough root-cause analysis plan spanning code, model, infrastructure, and data. Include steps to collect evidence (traces, logs, flame graphs), run targeted canary rollbacks or shadowing, compare request characteristics between variants, and what instrumentation you'd add to isolate the issue.
HardTechnical
48 practiced
Implement (in Python) a function that estimates cost-per-request for inference given: model_flops_per_inference, instance_flops (FLOPS per second), instance_hourly_price, expected_gpu_utilization (0-1), network_bytes_per_request, egress_price_per_gb, and target_requests_per_second. The function should return dollars per request and recommended instance count to meet the target_requests_per_second at a utilization cap (e.g., 70%). Show your calculations and assumptions in comments.
EasyTechnical
58 practiced
What tools and techniques would you use to profile an end-to-end ML inference request in production? Cover client-side timing, distributed tracing (e.g., OpenTelemetry), server CPU/GPU profiling (cProfile, perf, nvidia-smi, Nsight), and infrastructure metrics. Describe a basic low-overhead workflow for capturing and correlating data across stages (client, network, preprocess, model, postprocess).
EasyTechnical
62 practiced
Describe the difference between throughput and latency in ML inference systems. Give concrete examples where optimizing for throughput (e.g., large batching) harms latency and where optimizing for low latency harms throughput. Which metrics (p50/p95/p99, QPS, concurrency) matter for interactive user-facing services versus batch scoring jobs?

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.