This part summarises research works about a) scaling law and other interesting observations, findings about model scaling up; b) engineering solutions about scaling up.

Scaling Law

Data Governance

Engineering Scaling-up

Here, we give a summary of the following blog post by OpenAI:

This article introduces 4 main techniques:

  1. Data parallelism — run different subsets of a batch on different GPUs
  2. Pipeline parallelism — run different layers of the model on different GPUs
  3. Tensor parallelism — break-up the math for a single operation such as matrix multiplication to be split across GPUs
  4. Mixture-of-Experts — process each example by only a fraction of a layer

It also discusses a bunch of memory-saving tricks: