Vaishali Vinay, Microsoft
Most LLM evaluation pipelines quietly assume something we rarely have in production: a stable ground truth. Benchmarks treat correctness as singular. Human annotation treats disagreement as noise. Metrics assume yesterday's distribution still represents today's reality. In practice, these assumptions often fail. Real-world deployments operate in environments where multiple answers may be valid, labels evolve faster than evaluation datasets, and models that improve on benchmarks can still regress in the decisions they influence.
This talk draws from production LLM evaluation workflows to examine recurring failure modes: annotator disagreement misinterpreted as model error, metric drift mistaken for model regression, benchmark overfitting presented as progress, and evaluations that lose meaning as tasks, users, and environments evolve.
Rather than proposing another benchmark, we argue that evaluation should be treated as decision-making under uncertainty. We present practical patterns that help organizations assess model quality when ground truth is incomplete, contested, or constantly changing.

Vaishali Vinay is a Senior Data & Applied Scientist at Microsoft, where she works on the design, deployment, and evaluation of large-scale AI systems for security and decision-support. With over a decade of experience in applied AI and cybersecurity, she has led and contributed to production systems for threat detection, alert correlation, and AI-driven efficacy analysis. Her work involves human-in-the-loop evaluation, longitudinal quality measurement, and assessing model behavior in real-world operational settings. She has authored multiple patents and publications and has delivered invited talks on applied AI and security. Her work has directly informed enterprise-scale security operations and product decisions.
