This part summarises research works about a) scaling law and other interesting observations, findings about model scaling up; b) engineering solutions about scaling up.

Contents.

Scaling Law

[ ] collect several important Scaling Law related papers
Scaling Laws for Neural Language Models, Jan. 23 2021.
Scaling Laws for Autoregressive Generative Modeling, Oct. 28 2021.
Beyond neural scaling laws: beating power law scaling via data pruning, Jun. 29 2022.
Broken Neural Scaling Laws, Oct. 26 2022.
Transcending Scaling Laws with 0.1% Extra Compute, Oct. 20 2022.

Data Governance

Data Governance in the Age of Large-Scale Data-Driven Language Technology, arXiv, May 4 2022.

Engineering Scaling-up

Here, we give a summary of the following blog post by OpenAI:

Techniques for Training Large Neural Networks, June 9 2022.

This article introduces 4 main techniques:

Data parallelism — run different subsets of a batch on different GPUs
Pipeline parallelism — run different layers of the model on different GPUs
Tensor parallelism — break-up the math for a single operation such as matrix multiplication to be split across GPUs
Mixture-of-Experts — process each example by only a fraction of a layer

It also discusses a bunch of memory-saving tricks: