In this article, I review and read/summarise several papers around the development of foundations models for natural language.

Contextualized Representations

ELMO —> BERT —> ELECTRA/XLNet/RoBERTa/ALBERT

xxx

Few-shot and Zero-shot Learning

T5 and GPT series.

xxx

Sequence-to-Sequence Pretraining

XLM, T5 and MASS.

xxx

Reference

[ELMO] Deep Contextualized Word Representations, EMNLP 2018.

[BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, ACL 2019, and first published at Oct. 11 2018.

[MASS] MASS: Masked Sequence to Sequence Pre-training for Language Generation, ICML 2019.

[XLNet] XLNet: Generalized Autoregressive Pretraining for Language Understanding, NIPS 2019.

[XLM] Cross-lingual Language Model Pretraining, NIPS 2019.

[ALBERT] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, ICLR 2020.

[RoBERTa] RoBERTa: A Robustly Optimized BERT Pretraining Approach, ICLR 2020 rejected.

[ELECTRA] ELECTRA: Pre-training Text Encoders as Discriminators Rather than Generators, ICLR 2020.