Large Language Model Observability and Evaluation Questions

Covers the end to end product and technical considerations for monitoring, evaluating, and troubleshooting large language model systems. Topics include what observability means for model driven features, which signals to capture such as input provenance, token usage, latency, error modes, and outcome quality, and how to design instrumentation and data contracts that ensure consistent and auditable telemetry. It includes evaluation approaches and metrics such as relevance, accuracy, hallucination rate, calibration, and cost, and the trade offs between human labeling, automated metrics, and model driven judges. Product design aspects cover dashboards, alerts, logging, tracing, debugging interfaces, and developer workflows that make investigation and root cause analysis efficient. Finally this topic addresses operational concerns for an observability platform including storage and cost trade offs, scaling telemetry pipelines, privacy and compliance constraints, and how evaluation and observability feed back into model improvement cycles.

MediumTechnical

23 practiced

Design an automated anomaly detection solution (statistical + ML-based) to surface sudden quality drops such as increased hallucination or degraded relevance without dense labels. Describe features you would compute, model types, training/evaluation approach, false positive handling, alerting integration, and how to surface explainability to engineers and PMs.

MediumTechnical

21 practiced

A major enterprise customer reports that after a recent model update the assistant returned hallucinated personal data that appears tied to their users. As TPM, describe immediate mitigation actions you would take, the short-term rollback/containment plan, stakeholder communications, and telemetry you would collect for a thorough root cause analysis.

HardSystem Design

28 practiced

Design a developer-facing debug UI that lets engineers replay a request, inspect the exact prompt and response, step through token-level metadata (for example, model probabilities/confidences or retrieval matches), and run local replays with instrumentation. Discuss backend requirements for storage, retrieval latency, RBAC, audit logs, and cost trade-offs when storing token-level detail.

EasyTechnical

24 practiced

Write a single standard SQL query to compute average latency, median (P50), 95th percentile (P95), and request count per model_version over the last 24 hours. Assume an inference_logs table with columns: id STRING, start_ts TIMESTAMP, end_ts TIMESTAMP, model_version STRING, latency_ms INT. Use SQL features common to BigQuery/Postgres.

HardTechnical

28 practiced

Describe a taxonomy to classify different hallucination types (for example: fabricated facts, hallucinated entities, incorrect dates, misleading paraphrase). For each class explain detection strategies, product impact, monitoring metrics, and typical mitigation approaches (prompting, retrieval, calibration, or model changes).

Unlock Full Question Bank

Get access to hundreds of Large Language Model Observability and Evaluation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.