tencent / patrickstar Goto Github PK

View Code? Open in Web Editor NEW

746.0 16.0 58.0 3.53 MB

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.

License: BSD 3-Clause "New" or "Revised" License

Python 89.37% C++ 10.63%

nlp pretrained-models gpt bert pytorch

patrickstar's Introduction

PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management

Recent Progress

See CHANGE_LOG.md.

Meeting PatrickStar

Pre-Trained Models (PTM) are becoming the hotspot of both NLP research and industry application. However, the training of PTMs requires enormous hardware resources, making it only accessible to a small portion of people in the AI community. Now, PatrickStar will make PTM training available to everyone!

Out-of-memory error (OOM) is the nightmare of every engineer training PTMs. We often have to introduce more GPUs to store the model params to prevent such errors. PatrickStar brings a better solution for such problem. With the heterogeneous training (DeepSpeed Zero Stage 3 also uses it), PatrickStar could fully use both the CPU and GPU memory so that you could use fewer GPUs to train larger models.

System Design

The idea of Patrick is like this. The non-model data (mainly activations) varies during training, but the current heterogeneous training solutions are statically splitting the model data to CPU and GPU. To better use the GPU, PatrickStar proposes a dynamic memory scheduling with the help of a chunk-based memory management module. The memory management of PatrickStar supports offloading everything but the current computing part of the model to the CPU to save GPU. In addition, chunk-based memory management is efficient for collective communication when scaling to multiple GPUs. See the paper and this doc for the idea behind PatrickStar.

Results

In experiment, Patrickstar v0.4.3 is able to train a 18 Billion(18B) param model with 8xTesla V100 GPU and 240GB GPU memory in WeChat datacenter node, whose network topology is like this. PatrickStar is over twice as large as DeepSpeed. And the performance of PatrickStar is better for models of the same size as well. The pstar is PatrickStar v0.4.3. The deeps indicates performance of DeepSpeed v0.4.3 using the official example DeepSpeed example zero3 stage with activation optimizations opening by default.

We also evaluated PatrickStar v0.4.3 on a single node of A100 SuperPod. It can train 68B model on 8xA100 with 1TB CPU memory, which is over 6x larger than DeepSpeed v0.5.7. Besides the model scale, PatrickStar is way more efficient than DeepSpeed. The benchmark scripts are in here.

Detailed benchmark results on the WeChat AI data center and NVIDIA SuperPod are posted on this Google Doc.

Scale PatrickStar to multiple machines (node) on SuperPod. We succeed in training a GPT3-175B on 32 GPU. As far as we know, it is the first work to run GPT3 on such a small GPU cluster. Microsoft used 10,000 V100 to pertain GPT3. Now you can finetune it or even pretrain your own one on 32 A100 GPU, amazing!

We've also trained the CLUE-GPT2 model with PatrickStar, the loss and accuracy curve is shown below:

Installation

pip install .

Note that PatrickStar requires gcc of version 7 or higher. You could also use NVIDIA NGC images, the following image is tested:

docker pull nvcr.io/nvidia/pytorch:21.06-py3

Usage

PatrickStar is based on PyTorch, making it easy to migrate a pytorch project. Here is an example of PatrickStar:

from patrickstar.runtime import initialize_engine

config = {
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": (0.9, 0.999),
            "eps": 1e-6,
            "weight_decay": 0,
            "use_hybrid_adam": True,
        },
    },
    "fp16": {  # loss scaler params
        "enabled": True,
        "loss_scale": 0,
        "initial_scale_power": 2 ** 3,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
    },
    "default_chunk_size": 64 * 1024 * 1024,
    "release_after_init": True,
    "use_cpu_embedding": False,
    "client": {
        "mem_tracer": {
            "use_async_mem_monitor": args.with_async_mem_monitor,
        }
    },
}

def model_func():
    # MyModel is a derived class for torch.nn.Module
    return MyModel(...)

model, optimizer = initialize_engine(model_func=model_func, local_rank=0, config=config)

...

for data in dataloader:
    optimizer.zero_grad()

    loss = model(data)
    model.backward(loss)
    optimizer.step()

We use the same config format as DeepSpeed configuration JSON, which mainly includes params of optimizer, loss scaler, and some PatrickStar-specific configuration.

For a detail explanation of the above example, please check the guide here

For more examples, please check here.

A quick-start benchmark script is here. It is executed with randomly generated data; therefore you do not need to prepare the real data. It also demonstrated all of the optimization techniques for patrickstar. For more optimization tricks running the benchmark see Optimization Options.

License

BSD 3-Clause License

Cite Us

@article{fang2021patrickstar,
  title={PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management},
  author={Fang, Jiarui and Yu, Yang and Zhu, Zilin and Li, Shenggui and You, Yang and Zhou, Jie},
  journal={arXiv preprint arXiv:2108.05818},
  year={2021}
}
@article{fang2022parallel,
  title={Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management},
  author={Fang, Jiarui and Zhu, Zilin and Li, Shenggui and Su, Hui and Yu, Yang and Zhou, Jie and You, Yang},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  volume={34},
  number={1},
  pages={304--315},
  year={2022},
  publisher={IEEE}
}

Contact Us

{jiaruifang, zilinzhu, josephyu}@tencent.com

patrickstar's People

Contributors

Stargazers

Watchers

patrickstar's Issues

PatrickStar's Performance in Models Like GANs

Hi! I am a newbie in this field. DeepSpeed provides a tutorial on GAN (https://www.deepspeed.ai/tutorials/gan/). I am curious about PatrickStar's performance in models like GANs or other CV models. I really hope that PatrickStar can make my poor GPU accommodate a large-scale GAN.

support using PatrickStar on MegatronDeepSpeed?

PatrickStar is awesome, it helps reduce memory used by model state!

currently trends show using MegatronDeepSpeed as framework to train transformer based NLP models, both pretrain and finetuing. so will you guys support PatrickStar on MegatronDeepSpeed?

Search the best chunk size.

Chunk size is a critical hyperparameter in PatrickStar. An appropriate chunk size setting is able

reduce fragments in chunks.
improve memory utilization.

We intend to develop a script to choose the best chunk size before the exact training process starts.

支持管理 buffer

目前的派大星并不并不管理 buffer。这使得所有的 buffer 都被放在了 cpu 上，导致 runtime error。目前有以下的几个管理 buffer 的方式：

如果 buffer 总量较小，可以不把 buffer 放进 chunk 管理的体系。
1.1 可以把所有 buffer 都直接放在 gpu 上。可能会和一些 cpu 操作冲突，例如 embedding。
1.2 利用 hook 在运行过程中管理 buffer，把 buffer 视为 torch type param。
如果 buffer 总量较大，可以单独有一个 BUFFER 类的 chunk list，单独列出的原因是 buffer 是不进行求导更新的。

[perf] 持续优化性能

[dist] global chunk list中的chunk尺寸对齐。

多GPU内存管理

Memory-centric tiling

Memory-centric tiling(MCT) is able to split a model data tensor into pieces, and they do not need to be stored in contiguous memory space. This will help reduce chunk size.
DeepSpeed MCT

This technique is a trick and should not be implemented in core function of patrickstar.
It is helpful to improve our benchmark results, and therefore should be put in the dir ./example.

开源roadmap

完成以下feature可以对内开源

新feature开发

longformer, GPT等huggingface model接入patrickstar测试。DeepSpeed之前测试不能支持longformer
- 单卡、多卡 BERT, GPT 并 load from pretraind
- longformer
多机并行
流水线并行支持

探索性feature

模型并行
activations的内存分配我们能否管理起来，然后用一些优化手段（offload或者复用）来降低它的memory footprint。

Reorganize logic of manager.

For historical reasons, the Manager is a singleton and involves Metronome, Training Stage, and Memory tracer.
I am going to split these functions out of the Manager. And make it work as a runtime tracker belonging to the Client.

Proposal: overlap NVMe read and write with computing.

定义：P param fp16 (grad fp16), OS (optimizer state)
假设1：GPU无法存放P，CPU+GPU可以存放P，CPU+GPU无法存放OS+P
假设2：access_chunk中的，CPU-GPU-NVME之间的通信无法和计算重叠。

ADMA :
access_chunk的方向
GPU (P) -(P)-> CPU (P, OS) <-(OS)- NVMe (OS)
offload_chunk的方向
GPU (P) <-(P)- CPU (P, OS) -(OS)> NVMe (OS)

推论1：ADAM计算完CPU是满的。
推论2：FWD前GPU上的P是满的。

FWD :
access_chunk的方向
GPU (P) <-(P)- CPU (P, OS) <-(None)- NVMe (OS)
offload_chunk的方向
GPU (P) -(P)-> CPU (P, OS) -(OS)-> NVMe (OS)
GPU上P总量减少->CPU上P总量增多。

BWD：
access_chunk的方向
GPU (P) <-(P)- CPU (P, OS) <-(None)- NVMe (OS)
offload_chunk的方向
GPU (P) -(P)-> CPU (P, OS) -(None)-> NVMe (OS)
GPU上P总量增多->CPU上P总量减少。

推论2：FWD+BWD时候需要OS以chunk为粒度向NVMe移动（如上加粗），这种移动穿插在FWD计算之间。
推论3：OS从来不会出现在GPU上。
推论4：有一部分OS一直在CPU上，我们称之为OS_cpu。
推论5：OS去除OS_cpu为OS_nvme.

优化：
在AMAM计算之后，FWD开始时异步offload OS_nvme到NVMe。这部分offload操作和FWD计算重叠。
这样可以减少FWD阶段细粒度的NVMe移动开销。
在FWD结束后，BWD开始前，我们把OS_nvme异步从NVMe移动到CPU上，这部分操作可以和BWD计算重叠。
这样可以减少ADMA阶段NVMe移动开销。

CPU Embedding

因为embedding的参数比其他layer参数大很多，我们不把ebd参数交给chunk管理，并将其计算固定在CPU中。
这样每次计算前用hook将input从gpu拷贝到cpu，CPU ebd layers计算之后，再将输出的activations拷贝回GPU，以参与后面计算。
但是，有些PyTorch版本不支持torch.half类型的CPU embedding计算（比如torch, 1.4.0+cu100不支持，1.7.1+cu110则支持）。
现在cpu ebd也有param fp16和param fp32两份参数，但是param fp16也存成torch.float类型，用于FWD和BWD的计算，param fp32用于ADAM计算。而且每个进程都存储全部参数。
这样存在巨大内存浪费，首先其实只需要存一份torch.float类型的param，并可以用模型并行方式，分布在多个进程。

Error when install under python3.6

pip install . --user met below error, because future is not supported on python3.6. comment patrickstar/core/eviction_policy.py#L30 solved this, seems importing future is not necessary?

    ERROR: Command errored out with exit status 1:
     command: /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.4/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-9pb26brm/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-9pb26brm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-req-build-9pb26brm/pip-egg-info
         cwd: /tmp/pip-req-build-9pb26brm/
    Complete output (14 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-9pb26brm/setup.py", line 32, in <module>
        from patrickstar.ops.op_builder import CPUAdamBuilder
      File "/tmp/pip-req-build-9pb26brm/patrickstar/__init__.py", line 30, in <module>
        from .core import PatrickStarClient
      File "/tmp/pip-req-build-9pb26brm/patrickstar/core/__init__.py", line 31, in <module>
        from .chunk_list import ChunkList
      File "/tmp/pip-req-build-9pb26brm/patrickstar/core/chunk_list.py", line 43, in <module>
        from patrickstar.core.eviction_policy import ChunkEvictionPolicyBase
      File "/tmp/pip-req-build-9pb26brm/patrickstar/core/eviction_policy.py", line 30
        from __future__ import annotations
        ^
    SyntaxError: future feature annotations is not defined

More Accurate Memory Statistics Sampling

Currently, during training, we sample cuda/cpu memory usage before and after submodule( operator in paper ) computing. However, it is not able to accurately depict the max memory consumption of a submodule, which will easily lead to OOM during submodule computing!
This issue advocates a more accurate sampling method. During warmup-iteration, we launch a thread to concurrently sample CPU and GPU memory usage every 0.01s. In this way, we know the peak memory of a submodule more accurately.

[debug]简化debug逻辑

debug_flag太分散了，简化多机模拟单机的逻辑。

目前Chunk Reuse的局限

现在chunk reuse方式可以将overall memory footprint从DeepSpeed的18M降低到14M（M是参数量）。但是派大星目前实现有局限。派大星采用静态方式去设计重用方案。在训练开始前，它规定param fp16所在的chunk内存，被grad fp16复用。这种方式前提是参数不会被BWD更新两边。对于BERT和GPT不会有任何问题，但是对于LSTM和seq2seq transformer不能work。

针对后者，我们可以改成一种动态重用方式。也就是BWD时候实时监测chunk的空隙（param fp16不用时候可以释放出chunk的空隙），把grad fp16内存分配在空隙处。
不过LSTM和seq2seq目前很少有超大模型需求。我们可以暂时保留这个需求，等到必要时再去做。

增加logger

Optimize chunk allocate and release

Running a 40B model on an 8xA100 SuperPod node. The time details are as follows.
allocate payload takes too much time. This comes from frequent chunk new and del for communication buff.
If that is true, we should cache (world_size - 1) chunks in GPU memory.

Step 4 elaspe 63.65251803398132 s, 41.79747933716272 Tflops
CHUNK_LIST_prepare_device ............. 69.95204448699951, 16.157479393621628 %
CHUNK_allocate_payload ................ 170.51438927650452, 39.38530676631084 %
CLIENT_access ......................... 89.209885597229, 20.605643463536712 %
CLIENT_release ........................ 2.333692789077759, 0.5390348977945169 %
chunk_cpu_gpu_move .................... 123.20745587348938, 28.458380938192935 %
CLIENT_fetch_remote_chunks_broadcast .. 2.685667037963867, 0.620333689204676 %
CLIENT_fetch_remote_chunks ............ 130.1259322166443, 30.056406267825142 %
CLIENT_access_dist .................... 263.52323508262634, 60.86843167792419 %
CLIENT_release_dist ................... 18.868732452392578, 4.358287995999212 %
chunk_gpu_cpu_move .................... 76.71733140945435, 17.720121126871 %
CHUNK_LIST_chunk_move ................. 76.74013209342957, 17.725387614565417 %
FWD ................................... 79.00622844696045, 18.24880913004294 %
CLIENT_release_dist_reduce ............ 0.033098697662353516, 0.007645116441658537 %
HOOK_torch_allreduce .................. 5.245239019393921, 1.2115420212804222 %
BWD ................................... 238.41187143325806, 55.068224640575735 %
ADAM_prepare_data_grad_copy ........... 11.999637842178345, 2.771668828091965 %
ADAM_prepare_data ..................... 65.50793719291687, 15.130982276149549 %
ADAM_compute .......................... 27.286763429641724, 6.302679515178464 %
ADAM_param_fp32_to_fp16 ............... 21.466453075408936, 4.958307877399151 %
ADAM_release_data ..................... 0.21362972259521484, 0.04934405943401391 %
ADAM .................................. 115.520991563797, 26.68296622938133 %
CHUNK_LIST_make_room .................. 7.30447244644165, 1.6871824676488498 %
TOTAL ................................. 432.9390914440155
------------- DATA MOVE RESULTS --------------
chunk_cpu_gpu_move: 903168.0 MB, 1176 times, 7330.465462474785 MB/s
chunk_gpu_cpu_move: 962304.0 MB, 1253 times, 12543.50200040208 MB/s
ADAM_prepare_data_grad_copy: 99757.8125 MB, 1045 times, 8313.401938628052 MB/s
ADAM_param_fp32_to_fp16: 199515.625 MB, 1045 times, 9294.29861091289 MB/s

Support partial chunk management

As we hope to support MoE in #187, and MoE is mainly of model parallel structure instead of data parallel, we need to support managing only part of the model with chunk.

There are several design choices need to make, including but not limited to:

Shall we use mixed precision training in the unmanaged part of the model

How could we connect the backward of the unmanaged parts with the managed parts, i.e. if there are 3 parts in the model:

class Net(nn.Module):
  def __init__(self, ...):
    self.A = SubNetA(...)  # managed by chunk
    self.B = SubNetB(...)  # not managed by chunk
    self.C = SubNetC(...)  # managed by chunk

Then self.A and self.C need model.backward(loss) while self.B only need loss.backward().

cc @feifeibear

[dist] prepare_device正确换出

[dist]清理分布式逻辑和串行逻辑

Compile extensions during installation instead of jit

在业务实践中遇到了比较奇怪的问题，业务镜像不允许写入编译的结果（会报各式各样的错），所以需要改成在安装的时候编译对应的 .so，这样在制作镜像的时候就把写入操作都完成了，运行的时候就不会有问题了。

目前不能直接运行派大星脚本

当前运行派大星总是需要 init_process_group。应该允许用户通过 python xxx.py 的方式直接运行。

和DeepSpeed兼容

派大星的使命通过开源让PTM训练**化，因此我们必须要让deepspeed的接口足够简单，并保证精度和某个广受认可的训练框架一致。广受认可的框架有几个选项

PyTorch ZeroRedundancyOptimizer
Zero系列 DeepSpeed

我们可以选择融入到deepspeed的生态中，具体来说就是精度和DeepSpeed对齐，尽可能小改动的去把派大星的弄到deepspeed里面去，有如下原因

deepspeed是目前PTM训练的无法绕开的baseline，不仅有多篇顶级会议论文发表，用户如果寻求PTM训练时，优先会尝试DeepSpeed。
deepspeed是微软内部大团队开发的，不仅涉及并行训练，还有量化稀疏，推理加速等一些工具选择。融入这个生态，意味着我们也可以享受这些技术的进步，复用deepspeed的功能组件。

[fp16]支持 loss scaler

在混合精度训练中使用 loss scaler。目标是和 deepspeed 的混合精度对齐。

Support both dynamic model data partition and static model data partition.

Two partitions for model data.

static: DeepSpeed Zero-Offload is one type of status partition, in which the param fp16 tensors on GPU and the rest of model data on CPU. Generalized to more general scenarios, fixed-size GPU memory is left for model data. The size is assigned in config.
dynamic: A runtime sampling method is adopted to collect non-model data statistics. Based on the stats, model data is dynamically placed on GPU memory. However, the sampling may not be accurate. It is an interesting topic to design an accurate sampler (We are working on this).

We are going to make user support both static and dynamic partitions.

Support MoE

Mixure of Experts(MoE) is a popular PTM structure. We hope to support MoE trainining in PatrickStar.

So far, there are few MoE implementation in pytorch, e.g. laekov/fastmoe, and none of them support huggingface. Therefore, we hope:

Create a MoE submodule for huggingface.
Use CPU offloading in the MoE submodule.

Notice that we may not able to put the experts in chunks as they may be visited randomly.

TODO before release

分析超过12B模型运行失败原因。给出框架极限model-scale的理论分析。
在12B模型是会出现pin memory分配error，怀疑是cpu内存不足导致的。希望有一个精确的cpu memory profiler汇报CPU的内存使用情况。
重新跑一遍benchmark。
支持模型不同layer共享参数情况。

[dist]通过warmup记录access_dist allgather的位置

CLUE数据集准备

如果我们想自己用派大星跑一个预训练模型出来，可以采用如下方案
https://huggingface.co/uer/gpt2-chinese-cluecorpussmall

C++ adam速度

Aug 10的性能结果
log.GPT2small_gpu_1_cs_64_bs_128_cpueb_1_margin_0.8_warmup_0.2_gpu_0.8_adamcvt_1

2021-08-10:14:34:53,509 INFO [memory_monitor.py:65] CPU Virtual Memory: used = 15.08 GB, percent = 96.6%
605 2021-08-10:14:34:53,509 INFO [test_bert.py:223] ckp True fp16 True ps True: step elapse 5.177955627441406 sec/iter, 18.463766371092152 Tflops
606 2021-08-10:14:34:53,509 INFO [test_bert.py:225] model 0.72940493
607 2021-08-10:14:34:53,509 INFO [global_timer.py:45] *********** PROFILE RESULTS *************
608 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_prepare_device, 0, 0.0 %
609 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_allocate_payload, 0, 0.0 %
610 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access, 0.019408226013183594, 0.338427821424322 %
611 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release, 0.014924049377441406, 0.2602357121256555 %
612 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_cpu_gpu_move, 0, 0.0 %
613 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access_dist, 0.03873419761657715, 0.6754213447995139 %
614 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release_dist, 0.3606679439544678, 6.289089298897653 %
615 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_gpu_cpu_move, 0, 0.0 %
616 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_chunk_move, 0, 0.0 %
617 2021-08-10:14:34:53,509 INFO [global_timer.py:50] FWD, 0.28232502937316895, 4.9229973187357 %
618 2021-08-10:14:34:53,509 INFO [global_timer.py:50] BWD, 2.9886157512664795, 52.1135067722565 %
619 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy, 0.2039637565612793, 3.5565852198787224 %
620 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data, 0.22702884674072266, 3.958779022397416 %
621 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_compute, 0.013135433197021484, 0.2290470049819615 %
622 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_param_fp32_to_fp16, 0.5844182968139648, 10.190700111226695 %
623 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_release_data, 0.016661882400512695, 0.29053889612597344 %
624 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM, 0.9849364757537842, 17.174671477149886 %
625 2021-08-10:14:34:53,509 INFO [global_timer.py:76] *********** DATA MOVE RESULTS *************
626 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_cpu_gpu_move: 0.0 MB
627 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_gpu_cpu_move: 0.0 MB
628 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy: 2782.4589920043945 MB, 393 times, 13641.92854120348 MB/s
629 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_param_fp32_to_fp16: 2782.4589920043945 MB, 393 times, 4761.0744002597885 MB/s

运行报错

我在尝试用你们提供的样例中的train_simple_net.py运行时，有一个报错：
Traceback (most recent call last):
File "/home/somnus/Learn_deeplearing/PatrickStar-master/examples/train_simple_net.py", line 90, in
model, optim = initialize_engine(model_func=model_func, local_rank=0, config=config)
File "/home/somnus/Learn_deeplearing/PatrickStar-master/patrickstar/runtime/init.py", line 70, in initialize_engine
client = PatrickStarClient(
File "/home/somnus/Learn_deeplearing/PatrickStar-master/patrickstar/core/client.py", line 53, in init
tracer_config = config.get("mem_tracer", None)
AttributeError: 'NoneType' object has no attribute 'get'
我并没有对代码进行任何改动，请问这是什么原因呢？

Accelerate Chunk List Construction Speed.

The chunk list construction in PreprocessCtx not including memory copy is very time-consuming. It prevents us from quickly testing large models and makes the interaction frequency lower.

[dist]模型初始化

rank 0进程读入模型，然后P2P传递给其他进程。

support huggingface from_pretrain

目前多卡不能兼容hf的from_pretrain接口载入模型
https://huggingface.co/transformers/_modules/transformers/modeling_utils.html#PreTrainedModel.from_pretrained

主要问题产生于，model初始化阶段，我们把non-local param设置为[]，而from_pretrain在模型初始化后还需要一个_load_state_dict_into_model过程，把state_dict弄到model的参数中。这样遇到[]参数会报错。

分析model scale极限时的内存使用情况

env MODEL_NAME="GPT3_12B" CPU_EBD=0 CS=128 ACT_OFFLOAD=0 GPU_NUM=8 BS=8 AW=1 bash run_bert.sh
commit 6b4739a

观察可以发现，我们的chunk data使用有点过度的，其实让chunk data少用些内存应该可以跑更大模型。

Add CI

We would like to have a CI to run unitests each time an MR proposed to branch develop and master. However, we currently have no idea how to find a GPU to run the unitests. Does anyone have ideas?

Skipping ADAM in warmup affects the overall performance.

On SuperNode, 4xA100. The schema is:
log.GPT_DS_20B_gpu_4_cs_384_bs_8_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_1
If we do not open loss scaler.

91.16946903509981 TFlops
CLIENT_fetch_remote_chunks_broadcast .. 0.008783340454101562, 0.045844479770245986 %
CHUNK_LIST_prepare_device ............. 1.2182786464691162, 6.358782407949941 %
CHUNK_allocate_payload_cuda ........... 2.8260343074798584, 14.750432744387322 %
CLIENT_fetch_remote_chunks ............ 3.865243434906006, 20.174565176495836 %
CLIENT_access_dist .................... 4.45051383972168, 23.229373011158142 %
CLIENT_release_dist ................... 0.5840051174163818, 3.04820369095599 %
chunk_cpu_gpu_move .................... 1.315248727798462, 6.864915917752167 %
FWD ................................... 3.430629253387451, 17.90610465665054 %
chunk_gpu_cpu_move .................... 1.211526870727539, 6.323541641863795 %
CHUNK_LIST_chunk_move ................. 1.2118439674377441, 6.325196722159518 %
CLIENT_release_dist_reduce ............ 0.004155874252319336, 0.021691507244168236 %
BWD ................................... 9.724532127380371, 50.757011949891286 %
ADAM_prepare_data_grad_copy ........... 0.7624289989471436, 3.9794837739847733 %
CLIENT_access ......................... 0.4744589328765869, 2.4764294477411446 %
ADAM_prepare_data ..................... 1.2415502071380615, 6.480247879758143 %
ADAM_compute .......................... 2.4682326316833496, 12.88289364880856 %
ADAM_param_fp32_to_fp16 ............... 2.1865973472595215, 11.41290359584114 %
CLIENT_release ........................ 0.02169013023376465, 0.11321122549126299 %
ADAM_release_data ..................... 0.024057865142822266, 0.1255695731730847 %
ADAM .................................. 6.003831148147583, 31.33688339345817 %
TOTAL ................................. 19.158992528915405
chunk_cpu_gpu_move: 165120.0 MB, 214 times, 1472599.3546247077 MB/s
chunk_gpu_cpu_move: 148992.0 MB, 194 times, 195513.34532889986 MB/s
ADAM_prepare_data_grad_copy: 49831.09375 MB, 525 times, 39540.773513608205 MB/s
ADAM_param_fp32_to_fp16: 99662.1875 MB, 525 times, 58856.407915226984 MB/s

open loss scaler.
69.97870338199793 TFlops
CLIENT_fetch_remote_chunks_broadcast .. 0.007257938385009766, 0.049356369721460944 %
CHUNK_LIST_prepare_device ............. 0.7660794258117676, 5.209592224489411 %
CHUNK_allocate_payload_cuda ........... 2.468181848526001, 16.784448888027352 %
CLIENT_fetch_remote_chunks ............ 3.210313558578491, 21.8311887637781 %
CLIENT_access_dist .................... 3.4126784801483154, 23.207336831979323 %
CLIENT_release_dist ................... 0.131638765335083, 0.8951869286978318 %
FWD ................................... 2.59273362159729, 17.63144193688909 %
chunk_gpu_cpu_move .................... 0.7620553970336914, 5.182227504426379 %
CHUNK_LIST_chunk_move ................. 0.7620978355407715, 5.182516100241714 %
chunk_cpu_gpu_move .................... 0.11212825775146484, 0.7625090559096998 %
CLIENT_release_dist_reduce ............ 0.0038404464721679688, 0.026116299962988396 %
BWD ................................... 6.16964864730835, 41.95564133156423 %
ADAM_prepare_data_grad_copy ........... 1.2602458000183105, 8.570086207136955 %
CLIENT_access ......................... 0.010293245315551758, 0.0699974558171156 %
ADAM_prepare_data ..................... 1.2735180854797363, 8.660342116395686 %
ADAM_compute .......................... 2.7817044258117676, 18.916505598856098 %
ADAM_param_fp32_to_fp16 ............... 1.6933107376098633, 11.5150703113443 %
CLIENT_release ........................ 0.010694026947021484, 0.07272290281474308 %
ADAM_release_data ..................... 0.011899471282958984, 0.0809203210300938 %
ADAM .................................. 5.942788362503052, 40.41291673154668 %
TOTAL ................................. 14.705170631408691

chunk_cpu_gpu_move: 195072.0 MB, 244 times, 148315.6728283042 MB/s
chunk_gpu_cpu_move: 179712.0 MB, 225 times, 148335.13341068564 MB/s
ADAM_prepare_data_grad_copy: 39864.875 MB, 420 times, 52286.67201149269 MB/s
ADAM_param_fp32_to_fp16: 79729.75 MB, 420 times, 36462.93182415404 MB/s

September Work Plan

派大星现在整块的功能已经满足开源的条件了。9月份需要重点关注如何释放开源的影响力。主要是应用效果和性能效果两个方面

应用效果

我们自己训练的一个大语言模型，和PatrickStar一起发布，这个不需要依赖业务方，完全自己可控。
#97
和PRC的算法同学，对longformer/xlnet/bert等训练场景进行提升，这些合作有助于增加派大星倍数，但是应用方可能存在模型不够大的情况。
对外拉一些用户试用，形成反馈的闭环。
#57

性能效果

我们继续提升benchmark指标，和DeepSpeed，Megatron对比
子任务包括memory profiler，可以有助于我们profile现在的性能瓶颈

我们真的需要模型并行（MP）么？

MP的风潮是Megatron-LM引入到PTM训练中的，通过对transformer的实现插入定制的集合通信操作，实现了模型切分。
模型并行有很多诟病，

在FWD和BWD都有大量的activations全局通信，通信量和batch size成正比。不仅通信量大于DP，还限制了batch size从而限制MP训练的计算负载规模，影响了计算性能（越大batch计算效率越高）。
MP需要对model定义代码进行定制修改。因此DeepSpeed的Example中也是在Megatron-LM基础上改的。有一些工作尝试简化这个修改工作，比如Mesh-TensorFlow和阿里巴巴的Whale，PyTorch似乎没有相关工作。如果从刷性能角度，这样并无大碍。如果从使用角度，算法同学不会接受的，因为推理端的代码还需要把自定义并行算子转化成PyTorch串行的。
在HP（异构并行），MP，PP，DP等组合下，MP的用法已经非常局限，并有被替代之势。DeepSpeed吧MP被安排在节点内并行，PP和DP用在节点间。HP+DP的引入，让GPU内存墙被进一步打破，模型并行的主要优势正在被HP和ZeroDP代替，以后节点内是否继续用MP都不一定。

MP and PatrickStar

在PatrickStar中，显存的最大消耗量和chunk size有关，即使不使用异构存储空间，把所有chunk都放在gpu中，model data的尺寸也是原来的1/N，和MP消耗类似。PatrickStar和PP兼容即可，不需要兼容MP。
之前Zero-Offload会去兼容MP，这是很奇怪的。阅读代码，我觉得是因为Zero3的通信用了非常差的设计，需要临时在gpu分配world_size*tensor_numel大小的临时buffer，加上预取的存在，可能同时分配了多个这样的buffer，尤其对于embedding layer这种大参数层，可能会爆炸内存，因此需要用MP减少单个进程的tensor_numel。

Support communication config before training

Currently, the training will start whether the config of 2 nodes are the same or not. This may cause some weird result during benchmarking. We should consider communicate the config among nodes to make sure they are running the same program...

Total cpu memory used is much smaller than cpu chunk memory

relates to #120 .

The following is the CPU memory figure of GPT2_4B model. We could find that the chunk memory is much larger than the total memory (psutil.virtual_memory().used).

One possible reason is that psutil failed to count the pin memory.

重构PatrickStarEngine

现在的engine里面设计模式十分混乱，需要重构。
Engine包括如下成员变量，client（负责访问和释放API），chunk-tensor-mapper(ctm)（数据库，存储tensor-chunk关系），chunkmgr（负责管理chunk）。这三个模块是解耦的，每次client的访问先去ctm寻址，然后可能触发mgr返回一个tensor。

现在的client包含ctm和chunkmgr其实很不合理。
ctm初始化有些放在model的初始化过程，有些放在client的初始化中了。

以下逻辑初始化engine

chunk_tensor_mapper = ChunkTensorIndex()
chunkmgr = ChunkMgr()
 # 构建param fp16, param fp32 的chunk tensor映射关系，并初始化对应chunk的内存
with ps_model_init(ctm = chunk_tensor_mapper, chunkmgr= chunkmgr):
    model = model_func()
# 根据param fp16的ctm初始化，momentum，variance和的chunk tensor映射关系，并初始化对应的内存
optimizer = PS_fp16_adam(chunk_tensor_mapper, chunkmgr= chunkmgr)
client = PSClient(chunk_tensor_mapper, chunkmgr)
model.register_hook(client)
return model, optimizer

支持TencentPretrain

TencentPretrain是TEG数据安全中心的repo，我们可以利用它们的模型结构和数据
https://git.woa.com/TencentNLP/TencentPretrain/merge_requests/61
TencentPretrain还有一个野生开源项目
https://github.com/dbiir/UER-py

Polish memory and speed profiler.

The current profiler is messy and we have to reorganize these code.
Memory and speed profiler for both PatrickStar and PyTorch.

Support NVMe

Similar to DeepSpeed.

[perf] 小模型速度对齐

目前在test_bert.py 中运行 Bert 模型时，开启和关闭派大星的速度差距较大。在 V100 上测试结果为：开启时单步速度为 0.35 s/iter，而关闭时单步速度为 0.13s/iter，并且 GPU 利用率差距较大。对于这种可以全部放在 GPU 上的小模型，派大星的速度应该和原生 pytorch 差距不大才对。

不排除是因为目前的 test_bert.py 中开启了过多的 profiler 的因素。

注释翻译至英文

代码注释翻译情况：

examples
patrickstar
tools
unitest

HybridPS管理梯度

[idea]进一步减少内存消耗，通过融合FWD+BWD+ADAM

有一个想法，我们其实可以进一步缩减memory footprint。
我们可以只保留param fp32，FWD时, submodule(etc. :Linear)需要的param fp16临时分配，并从param fp32拷贝数据。计算完毕就释放。
BWD时候, submodule需要的话再从param fp32转化，产生grad fp16，后立刻开始adam计算，更新param fp32。这样grad fp16也可以扔掉。
总的内存消耗从14M降低到12M，也就是等于OS的大小(M参数个数)。
也就是fusion FWD,BWD and ADAM
有一个paper支持了我这个想法：
OPTIMIZER FUSION: EFFICIENT TRAINING WITH BETTER LOCALITY AND PARALLELISM
https://arxiv.org/pdf/2104.00237.pdf