coincheung / gdgpt Goto Github PK

View Code? Open in Web Editor NEW

83.0 1.0 8.0 1.13 MB

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

License: Apache License 2.0

Python 63.86% C++ 24.99% Cuda 10.99% C 0.15% Shell 0.01%

deepspeed llm pipeline nlp pytorch full-finetune model-parallization bloom flash-attention baichuan2-7b

gdgpt's People

Contributors

Stargazers

Watchers

Forkers

bigcash l878619717 techthiyanes zhangsanfeng86 kfein lbin lwbmowgli zhangjiulong

gdgpt's Issues

It was found that the deepspeed folder exists. Has the deepspeed source code been modified in this project?

Multi-node model training

Is multi-machine training of large models suitable for multi-node large models? Secondly, can the large model be divided into blocks and allocated to each node for training? For example: Chatglm3 large model training requires four graphics cards with 48g of video memory on a single node to meet the demand. Can I use the multi-machine training method to divide the large model into two nodes with four graphics cards with 24g of video memory?

Any plan to incorporate tensor parallelism or zero data parallelism?

Would it be possible in this framework that the pipeline is incorporated to tensor parallelism or zero data parallelism?

Questions about TiedLayerSpec

Do you know how to use TiedLayerSpec? I want to finetune whisper large v2 using multiple GPU (single node). Embedding layer is used before the transformer decoder and after the transformer layer. According to the documentation, the embedding layer should be wrapped by TiedLayerSpec. But i don't know the working principle of TiedLayerSpec. After wrapping the embedding layer into TiedLayerSpec, how deepspeed reuse the layer at the end of transformer decoder or how should i implement it to let deepspeed to do so. There is too little documentation and explaination on TiedLayerSpec, hope someone can help me. Thank you!

ninja -v指令出错导致transformer_inference.so文件缺失

Hi～
我在运行demo.py时出现了以下Error：

Traceback (most recent call last):
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
    ......
ImportError: /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

我初步认为这是ninja -v指令执行存在问题，导致共享目标文件transformer_inference.so没有生成。

我已经尝试了网上解决Command '['ninja', '-v']' returned non-zero exit status 1的各种方法，例如安装或禁用ninja库、降低pytorch版本等，但都无法解决这个问题。

我使用的环境如下：

python==3.10.12
torch/cuda/deepspeed版本均与你的环境一致

请问你是否遇到过这个问题？如果没有的话可否分享一下你的transformer_inference.so文件，该文件大概在路径<user_path>/.cache/torch_extensions/pyXX_cuXX/transformer_inference处。

谢谢！

can it make Lora sft？

Limited by graphics card devices. For most people, Lora is the only way to fine-tuning. can it make lora sft?

看了您的碎碎念

很对，确实国内程序员行业确实鱼龙混杂，开源，实际上从不同的角度去看这个问题，细节还是很多的。但是究其原因，终究是利益的问题。我也很喜欢开源精神，在俺看来，人类的发展进步，离不开开源。只有分享了自己已知的知识，后来人才能在已有知识的基础上，再进行添砖加瓦。

有一点不得不承认，国外的程序员水平确实比国内高，确实要承认，人家确实发展计算机比国内早了好几十年。并且很多项目，研究其整体结构的时候，我都觉得设计基础类的架构师，简直是天才。

但是，也有一点需要承认，国外是技术导向，国内是资本导向。如果从现在人工智能领域来反推前几年的云计算大数据爆火。那个时候实际上云计算，大数据技术，由国外大公司开源出技术，本希望国内可以利用技术，好好收集数据，为日后的人工智能技术积累数据。然而，大数据云计算变成了捞钱的，并没有想到，收集数据实际上是为了以后的人工智能。我在工作中，遇到过很多数据，甚至字段设计，该有的数据库设计，比如一条数据的时间，来源，关键信息，数据的类别。这些工作甚至都没有做。

就像您说过的，都是要吃饭的，这个确实。在活着这件事上，本身就没有对错之分。只是对于喜欢技术的人来说，不是那么浪漫，不是那么让自己热血沸腾，把全部的快乐放在解决问题的那一个瞬间。

回到国内技术开源这个这个话题。我个人将haystack这个框架优化了支持中文。但是一直没敢往github或者知乎CSDN上放。我也考虑了很多这个问题。从整体考虑，这是为整个国内行业做贡献。但是，我如果放出来了，最先获利的也许还是那群玩资金，玩钱的有钱人们。并且，从我自身考虑，这个技术放出来，我能突破现在20K的工资吗，开源了这个技术，顶多收到一些赞叹，但是繁华散去，我依旧是那个北漂的打工人。曾经我希望所有技术都开源，但是最后，我发现，我成了那个不愿意开源的人。

无任何不良导向，只是诉说自己看法，求同存异，支持反驳。

看了作者的话，有感而生。

coincheung / gdgpt Goto Github PK

gdgpt's People

Contributors

Stargazers

Watchers

Forkers

gdgpt's Issues

It was found that the deepspeed folder exists. Has the deepspeed source code been modified in this project?

Multi-node model training

Any plan to incorporate tensor parallelism or zero data parallelism?

Questions about TiedLayerSpec

ninja -v指令出错导致transformer_inference.so文件缺失

can it make Lora sft？

看了您的碎碎念

Is there any plan to support chatglm2 or yi

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent