1. Introduction

I want to firstly outline the contents of this article. It is a reflection of my understanding of general reflection of my survey of of several.

Based on my unfinished survey on the topic LLM safety, I am motivated to write down some summaries and thoughts on this both old and infant topic.

My logic of structuring this article is as follows:

2. The definition issue

To google “large language model safety”, “”

xxx

Toxicity, truthfulness, the start

I think LLM safety, as a growing interest for both research community and industry, started from a few researchers who realized and advocated that: with ever-growing size but unfathomable training data, transformer-based language models, though with benchmark-killing performances, are prone to generate toxic and hallucinated contents.

After the launches of growing size gpt1/2/3, researchers from UW and AI2 curated and released a dataset named RealToxicityPrompts of 100K prompts. Each prompt is paired with a TOXICITY score (range of $[0, 1]$) from Perspective API, the higher the more toxic. They found the selected models could generate toxic contents even with unconditional sampling, and generate higher toxic score responses much more frequently conditioned on toxic prompts. They also tried domain-adaptive pre-training non-toxic corpus, attribute conditioning training, classifier-guided decoding to reduce model’s tendency to toxicity.