Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

HardSystem Design

0 practiced

Design an autoscaling policy for GPU-backed model-serving under highly bursty and heavy-tailed traffic. Include metrics (queue length, p95 latency, GPU utilization), cool-down periods, scale-up vs scale-down strategies, predictive scaling approaches, and how to incorporate spot instances without violating SLOs.

HardTechnical

0 practiced

Estimate the cost per inference and break down compute, storage, network, and control-plane costs for serving an LLM with 6B parameters at 200 qps. Assume average input+output tokens per request is 512 and provide a sample calculation using hypothetical instance pricing and bandwidth costs. Then propose three optimizations that would reduce cost per inference most effectively.

HardSystem Design

0 practiced

Design a multi-region, low-latency model-serving system for a global user base with 100k requests per second and a 50ms p95 SLO for interactive features, while minimizing inter-region egress charges. Describe model distribution, feature-store placement, replication strategies, and how you would handle model updates and consistency across regions.

MediumTechnical

0 practiced

You observe weekly spikes in p99 latency during nightly batch jobs that share the same GPU nodes as online inference. Describe mitigation strategies including workload shaping, scheduling, node isolation, workload priorities, and admission control. Explain how you'd detect and prevent these interference issues proactively.

HardTechnical

0 practiced

A framework upgrade increased throughput but introduced a 10% latency regression in p95. As the AI Engineer responsible for the service, how would you quantify cost vs benefit, present a recommendation to stakeholders, and structure a canary rollout or rollback plan that is SLO-aware?

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Join thousands of developers preparing for their dream job.