Salvatan

If your eval suite does not catch real issues before production, it is just theater. Good evals require thought about what actually matters for your use case.

Common Eval Mistakes

1. Testing only happy paths 2. Using generic benchmarks (MMLU, HellaSwag) for specialized domains 3. No adversarial cases 4. Ignoring cost and latency 5. LLM-as-judge without validation

What Works

Golden test sets from real production failures
Rubrics tied to user-facing outcomes
Edge cases and refusal scenarios
Tool use validation (did it call the right function?)
Cost per query regression tracking

Your eval suite should be opinionated about your product. Generic benchmarks are a starting point, not the goal.

Building Eval Harnesses That Matter

Common Eval Mistakes

What Works

Related Posts

Why Prompt Versioning Matters

Defending Against Prompt Injection

From Prototype to Production LLM