InterviewStack.io LogoInterviewStack.io

Performance Engineering and Cost Optimization Questions

Engineering practices and trade offs for meeting performance objectives while controlling operational cost. Topics include setting latency and throughput targets and latency budgets; benchmarking profiling and tuning across application database and infrastructure layers; memory compute serialization and batching optimizations; asynchronous processing and workload shaping; capacity estimation and right sizing for compute and storage to reduce cost; understanding cost drivers in cloud environments including network egress and storage tiering; trade offs between real time and batch processing; and monitoring to detect and prevent performance regressions. Candidates should describe measurement driven approaches to optimization and be able to justify trade offs between cost complexity and user experience.

HardSystem Design
0 practiced
Design a production monitoring and alerting system to detect performance regressions for a model-serving platform in near real-time. Include which signals to monitor, how to compute baselines, anomaly-detection approaches (statistical thresholds, change-point detection), alert prioritization, and automated mitigations (circuit-breakers, throttling).
MediumTechnical
0 practiced
A backend service issues several small synchronous calls to an inference microservice per single user request, causing high client-side tail latency. Propose a refactor to reduce tail latency, including batching/coalescing requests, asynchronous processing, and merging model calls. Explain the trade-offs in complexity and consistency.
MediumTechnical
0 practiced
You observe weekly spikes in p99 latency during nightly batch jobs that share the same GPU nodes as online inference. Describe mitigation strategies including workload shaping, scheduling, node isolation, workload priorities, and admission control. Explain how you'd detect and prevent these interference issues proactively.
MediumSystem Design
0 practiced
Design a set of 'golden signals' and instrumentation for monitoring model-serving systems so teams can detect and triage performance regressions quickly. Include specific metrics (latency percentiles, error rates, queue depth, resource saturation), tracing spans to capture, log structure for aggregations, and alerting thresholds to avoid alert fatigue.
HardTechnical
0 practiced
Implement a dynamic batching component (in Python pseudo-code) for an async inference server. The component should collect incoming requests into batches up to max_batch_size, respect per-request deadlines (max_wait_ms), and avoid starvation of older requests. Show how concurrency and timers are handled and describe complexity.

Unlock Full Question Bank

Get access to hundreds of Performance Engineering and Cost Optimization interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.