InterviewStack.io LogoInterviewStack.io

Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

MediumTechnical
47 practiced
Design a cost-aware caching strategy for an embeddings retrieval service to reduce network egress and storage I/O costs. Explain how you would select TTLs, cache size, cache key strategy (per-user vs global), and eviction policy to balance hit-rate, staleness, and cost under a fixed memory budget.
HardTechnical
49 practiced
Explain data layout and transfer optimizations for large batched tensor transfers between host and GPU to minimize PCIe overhead and maximize throughput. Discuss pinned (page-locked) memory, asynchronous cudaMemcpy, contiguous tensor layouts, tensor strides, and when zero-copy or GPUDirect might be applicable.
HardTechnical
58 practiced
Explain how to calculate and optimize FLOPS-per-dollar for candidate hardware (e.g., NVIDIA A10 vs A100 vs AWS Inferentia vs CPU) for a particular model and workload. Describe the benchmarking steps, what to measure (latency, batch throughput, power draw), and decision criteria beyond raw FLOPS (e.g., multi-tenancy, software stack maturity).
EasyTechnical
52 practiced
Explain model quantization at a high level. Describe the difference between post-training quantization and quantization-aware training, and summarize how quantization typically affects latency, memory footprint, and accuracy. Provide example scenarios where quantization is a good trade-off and cautionary cases where it can hurt user experience.
EasyTechnical
43 practiced
Describe a practical benchmarking plan to measure inference performance for an image-classification model before deployment. Include what metrics to collect (latency percentiles, throughput, GPU/CPU/memory utilization, warm vs cold starts), how to design the micro-bench (single vs multi-threaded clients), how many runs to do, and which open-source tools you would use (e.g., torch.profiler, nsys, wrk, hey). Explain how to reduce measurement noise and how you'd report p99 tail behavior.

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.