paper-buf
LLM safety
Towards scalable oversight: Meta-evaluation of llms as evaluators via agent debate
, the
arxiv
version, and
code
Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
NOTES
LLM
Evaluating the ripple effects of knowledge editing in language models
NOTES
From understanding to utilization: A survey on explainability for large language models
A practical review of mechanistic interpretability for transformer-based language models