paper-buf
Top-10 paper of the day to read
LLM safety related
What is in your data? Identifying benign data that breaks safety
Language models learn to mislead humans via rlhf
Salora: Safety-alignment preserved low-rank adaptation
Level -level self-exposure and patch: Affirmative token mitigation for jailbreak attach defense
Turning logic against itself: Probing model defenses through contrastive questions
Others
Predicting the performance of black-box llms through self-queries
Dynamical skill adaptation for large language models
Grokking at the edge of numerical stability
A statistical theory of contrastive pre-training and multimodal generative ai