[12:08] Jason Wei’s blog
- “key job of the team lead is to dictate what eval to optimize”
- “a big paper claims some victory using the eval … a good score on the eval must mean something significant and easily understandable”
- Jason continue to be specific of what a good eval should be with 7 characteristics
- Enough eval examples, at least 1000 examples
- High quality, with less wrongly annotated examples
- Simple to understand and compare, e.g. a single metric; counter-example, HELM
- Simple to carry out and use; counter-example, BIG-Bench
- Test examples involved should be of meaningful tasks; counter-example in BIG-Bench, where the model is asked to recommend movies or close parentheses
- The grading in the eval should be extremely correct, so for LLMs evals, the parsing from the natural language answer should be designed carefully
- Good evals should not saturate too quickly