0. Updates & motivations

0.1. Updates

Date Updates
25/02/16 Add new version of OpenAI’s Model Spec
25/02/06 Add discussion of the ART paper in Sec. 5.1

0.2. Motivations

1. Introduction

Based on my unfinished survey on the topic LLM safety, I am motivated to write down some summaries and thoughts on this both old and infant topic.

My logic of structuring this article is as follows:

  1. Firstly (Sec. 2), I want to introduce a few works about toxicity and truthfulness issue of LLMs. They mark a start of the general field of LLM safety.
  2. Secondly (Sec. 3), I try to describe the extension and intension of LLM safety as its definition.
  3. Then (Sec. 4), I discuss a number of papers that attempt to explain and understand the origins of LLM unsafety from statistical, mechanistic, and data-centric perspectives.
  4. Fourthly (Sec. 5), I summarize and exemplify several jailbreaking methods.
  5. Lastly (Sec. 6), I list and discuss several directions on how to defend.
  6. Bonus (Sec. 7), where I discuss from inside out the path towards LLM safety based on my understanding.

2. Toxic and untruthful, the start

This section motivates LLM safety by reflecting on early works that study toxicity and untruthfulness problem with early LLMs like GPT-3. I think LLM safety, as a growing interest for both research community and industry, started from a few researchers who realized and advocated that: