Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

HardTechnical

0 practiced

Describe a measurement-driven approach to optimize inference serving cost: what experiments you'd run (e.g., batch sizing, quantization, spot instances), what instrumentation to add, how to measure success, and acceptance criteria to roll changes into production. Emphasize safe rollback and business KPIs.

MediumTechnical

0 practiced

Compare serialization formats (JSON, Protocol Buffers, Avro, FlatBuffers, Apache Arrow) for transmitting model inputs and outputs between microservices in a low-latency ML inference pipeline. For each format identify serialization/deserialization cost, schema support, zero-copy capability, and suitability for vectorized data.

MediumSystem Design

0 practiced

Design a cache hierarchy for an online feature store to serve 10k req/s with p95 fetch latency under 20ms. Describe where you'd place caches (client, edge, in-memory store), cache key strategies, TTLs, consistency/invalidation mechanisms, and how you'd measure and tune cache hit rate to meet cost targets.

MediumTechnical

0 practiced

You receive a monthly bill breakdown for an ML service: compute 60%, storage 20%, network egress 15%, and managed services 5%. The business asks to cut monthly costs by 30% without changing model accuracy. Propose at least five concrete changes (with expected qualitative impact) that could achieve this and explain potential risks.

MediumTechnical

0 practiced

Tail latency (p99.9) is causing SLAs to be violated even though p95 is fine. Describe techniques to diagnose and mitigate tail latency in ML serving—cover queuing/backpressure, request hedging, prioritized scheduling, circuit breakers, and resource isolation—and recommend which to try first with justification.

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Join thousands of developers preparing for their dream job.