InterviewStack.io LogoInterviewStack.io

Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

HardTechnical
46 practiced
A framework upgrade increased throughput but introduced a 10% latency regression in p95. As the AI Engineer responsible for the service, how would you quantify cost vs benefit, present a recommendation to stakeholders, and structure a canary rollout or rollback plan that is SLO-aware?
MediumSystem Design
57 practiced
Design a scalable model-serving architecture to serve 1,000 requests per second with a 50ms p95 inference SLO and a strict monthly cost cap. Describe the high-level components (load balancer, autoscaling groups, model servers, feature cache, batching layer), where to apply batching, how to use spot instances or GPU multiplexing, and how you'd measure that cost and SLOs are being met.
HardSystem Design
60 practiced
Create a production-safe plan to implement admission control and graceful degradation for overloaded model-serving endpoints so premium customers maintain SLO and best-effort customers receive degraded results. Include request prioritization, throttling rules, serving degraded model variants or cached responses, and operational runbook steps.
HardTechnical
60 practiced
You suspect GPU memory fragmentation is causing intermittent OOMs even though aggregate free memory looks sufficient. Describe the diagnostic steps (profilers, allocator stats, reproducer), short-term mitigations (torch.cuda.empty_cache, pre-allocation, restart strategy), and long-term fixes (pooling allocators tuning, deterministic memory planning).
HardTechnical
58 practiced
Explain how to calculate and optimize FLOPS-per-dollar for candidate hardware (e.g., NVIDIA A10 vs A100 vs AWS Inferentia vs CPU) for a particular model and workload. Describe the benchmarking steps, what to measure (latency, batch throughput, power draw), and decision criteria beyond raw FLOPS (e.g., multi-tenancy, software stack maturity).

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.