InterviewStack.io LogoInterviewStack.io

Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

MediumSystem Design
93 practiced
Design a cache hierarchy for an online feature store to serve 10k req/s with p95 fetch latency under 20ms. Describe where you'd place caches (client, edge, in-memory store), cache key strategies, TTLs, consistency/invalidation mechanisms, and how you'd measure and tune cache hit rate to meet cost targets.
MediumTechnical
49 practiced
Tail latency (p99.9) is causing SLAs to be violated even though p95 is fine. Describe techniques to diagnose and mitigate tail latency in ML serving—cover queuing/backpressure, request hedging, prioritized scheduling, circuit breakers, and resource isolation—and recommend which to try first with justification.
MediumTechnical
61 practiced
When should you use asynchronous (background) processing for user-facing ML tasks versus synchronous inference? Propose an architecture where synchronous inference is used for critical decisions but long-running features are computed asynchronously and blended into final output later. Include implications for UX, consistency, and monitoring.
MediumTechnical
53 practiced
Describe memory optimization techniques to reduce peak memory usage during batched inference of deep learning models. Discuss zero-copy I/O, memory pooling, tensor memory formats, garbage collection tuning, and model sharding. For each, give one scenario where it's most effective.
HardTechnical
42 practiced
Describe a measurement-driven approach to optimize inference serving cost: what experiments you'd run (e.g., batch sizing, quantization, spot instances), what instrumentation to add, how to measure success, and acceptance criteria to roll changes into production. Emphasize safe rollback and business KPIs.

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.