2024-05-25/26 | Notion

[12:08] Jason Wei’s blog

“key job of the team lead is to dictate what eval to optimize”
“a big paper claims some victory using the eval … a good score on the eval must mean something significant and easily understandable”
Jason continue to be specific of what a good eval should be with 7 characteristics
1. Enough eval examples, at least 1000 examples
2. High quality, with less wrongly annotated examples
3. Simple to understand and compare, e.g. a single metric; counter-example, HELM
4. Simple to carry out and use; counter-example, BIG-Bench
5. Test examples involved should be of meaningful tasks; counter-example in BIG-Bench, where the model is asked to recommend movies or close parentheses
6. The grading in the eval should be extremely correct, so for LLMs evals, the parsing from the natural language answer should be designed carefully
7. Good evals should not saturate too quickly