This part summarises research works about a) scaling law and other interesting observations, findings about model scaling up; b) engineering solutions about scaling up.
Scaling Law
Data Governance
Engineering Scaling-up
Here, we give a summary of the following blog post by OpenAI:
This article introduces 4 main techniques:
- Data parallelism — run different subsets of a batch on different GPUs
- Pipeline parallelism — run different layers of the model on different GPUs
- Tensor parallelism — break-up the math for a single operation such as matrix multiplication to be split across GPUs
- Mixture-of-Experts — process each example by only a fraction of a layer
It also discusses a bunch of memory-saving tricks: