Evals
Engineering
Building Eval Harnesses That Matter
Salvatan
December 8, 2024
8 min read
If your eval suite does not catch real issues before production, it is just theater. Good evals require thought about what actually matters for your use case.
Common Eval Mistakes
1. Testing only happy paths 2. Using generic benchmarks (MMLU, HellaSwag) for specialized domains 3. No adversarial cases 4. Ignoring cost and latency 5. LLM-as-judge without validation
What Works
- Golden test sets from real production failures
- Rubrics tied to user-facing outcomes
- Edge cases and refusal scenarios
- Tool use validation (did it call the right function?)
- Cost per query regression tracking
Your eval suite should be opinionated about your product. Generic benchmarks are a starting point, not the goal.