Giter VIP home page Giter VIP logo

visualglm-6b's People

Contributors

1049451037 avatar bobo0810 avatar dm-thu avatar duzx16 avatar lykeven avatar mactavish91 avatar sleepychord avatar tuteng0915 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

visualglm-6b's Issues

输入一张数学试卷,输出了一堆随机数

输入的图片:https://picdl.sunbangyan.cn/2023/05/18/sue5dn.jpg
输入的文字:描述这张图片。
输出:20,36,48,57,69,76,87,95,104,113,122,131,140,149,158,167,176,185,194,203,212,221,230,239,248,257,266,275,284,293,302,311,320,329,338,347,356,365,374,383,392,401,410,419,428,437,446,455,464,473,482,491,500,509,518,527,536,545,554,563,572,581,590,599,608,617,626,635,644,653,662,671,680,689,698,707,716,725,734,743,752,761,770,779,788,797,806,815,824,833,842,851,860,869,878.

transformers version?

I have 2 conflict problems and I found their corresponding solutions. They ask me to upgrade/downgrade transformers to either 2.26.1 or 2.27.1. This is a problem because whichever one I choose, the other traceback comes up.

For this traceback, people say I should go for 2.27.1

Traceback (most recent call last):
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\gradio\routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\gradio\blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\gradio\blocks.py", line 898, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\gradio\utils.py", line 549, in async_iteration
    return next(iterator)
  File "web_demo_hf.py", line 63, in predict
    for response, history in model.stream_chat(tokenizer, image_path, input, history, max_length=max_length, top_p=top_p,
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\autograd\grad_mode.py", line 43, in generator_context
    response = gen.send(None)
  File "C:\Users\xxx/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1439, in stream_chat
    for outputs in self.stream_generate(**inputs, **gen_kwargs):
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\autograd\grad_mode.py", line 43, in generator_context
    response = gen.send(None)
  File "C:\Users\xxx/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1291, in stream_generate
    outputs = self(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\xxx/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1469, in forward    return super().forward(
  File "C:\Users\xxx/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1095, in forward    transformer_outputs = self.transformer(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\xxx/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 871, in forward
    logger.warning_once("Specify both input_ids and inputs_embeds at the same time, will use inputs_embeds")
AttributeError: 'Logger' object has no attribute 'warning_once'

And for this one, people say I should go for 2.26.1

Traceback (most recent call last):
  File "web_demo_hf.py", line 5, in <module>
    tokenizer = AutoTokenizer.from_pretrained("./vglm-6b", trust_remote_code=True)
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 663, in from_pretrained
    tokenizer_class = get_class_from_dynamic_module(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\dynamic_module_utils.py", line 399, in get_class_from_dynamic_module
    return get_class_in_module(class_name, final_module.replace(".py", ""))
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\dynamic_module_utils.py", line 177, in get_class_in_module
    module = importlib.import_module(module_path)
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers_modules.'

model use left-padding or right-padding ??

您好,请问VisualGLM-6B在训练时用的是left-padding还是right-padding,我在用VisualGLM-6B训练reward model时,会报assert divergence_ind >0的错误,打印信息时发现divergence_ind[0],我去deepspeed上查了一下,说是模型padding导致的,所以想咨询一下,VisualGLM-6B是left-padding还是right-padding

image

期待您的回复

web应用没有显存清理机制

使用web_demo.py启动一个web应用,传不同的图片上去提问,发现显存占用只增不减,最终有可能会触发显存oom

Torch not compiled with CUDA enabled

请问大佬,按照read me走流程,换了两个环境尝试 都报错这个是什么问题呢,应该怎么处理?
报错:
Traceback (most recent call last):
File "", line 1, in
File "D:\soft\python\lib\site-packages\torch\nn\modules\module.py", line 905, in cuda
return self._apply(lambda t: t.cuda(device))
File "D:\soft\python\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "D:\soft\python\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "D:\soft\python\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
param_applied = fn(param)
File "D:\soft\python\lib\site-packages\torch\nn\modules\module.py", line 905, in
return self.apply(lambda t: t.cuda(device))
File "D:\soft\python\lib\site-packages\torch\cuda_init
.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")

接受多张图片作为输入

InstructBLIP 论文中指出,即使他们没有针对视频进行训练和微调,他们在VideoQA测试集上,将Video切帧后直接拼接输入Q-Former,亦有一定的理解能力。想问VisualGLM是否进行过类似实验?

Finetune error: RuntimeError: FIND was unable to find an engine to execute this computation

I had download the latest version of VisualGLM-6B. I used the following commands to setup the development environment:

conda create -n glm python=3.9
conda activate glm
git clone https://github.com/THUDM/VisualGLM-6B.git
cd VisualGLM-6B
pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt
# edit finetune/finetune_visualglm.sh to set NUM_GPUS_PER_WORKER=2 which is the number of GPU in my server
unzip fewshot-data.zip
bash finetune/finetune_visualglm.sh

It reported errors as below:

Traceback (most recent call last):
  File "/media/zjkj/2t/yantao/VisualGLM-6B/finetune_visualglm.py", line 188, in <module>
    training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 130, in training_main
    iteration, skipped = train(model, optimizer,
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 274, in train
    lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 348, in train_step
    forward_ret = forward_step(data_iterator, model, args, timers, **kwargs)
  File "/media/zjkj/2t/yantao/VisualGLM-6B/finetune_visualglm.py", line 84, in forward_step
    logits = model(input_ids=tokens, image=image, pre_image=pre_image)[0]
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1724, in forward
    loss = self.module(*inputs, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/model/official/chatglm_model.py", line 192, in forward
    return super().forward(input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/model/base_model.py", line 144, in forward
    return self.transformer(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/model/transformer.py", line 451, in forward
    hidden_states = self.hooks['word_embedding_forward'](input_ids, output_cross_layer=output_cross_layer, **kw_args)
  File "/media/zjkj/2t/yantao/VisualGLM-6B/model/visualglm.py", line 20, in word_embedding_forward
    image_emb = self.model(**kw_args)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/zjkj/2t/yantao/VisualGLM-6B/model/blip2.py", line 65, in forward
    enc = self.vit(image)[0]
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/zjkj/2t/yantao/VisualGLM-6B/model/blip2.py", line 29, in forward
    return super().forward(input_ids=input_ids, position_ids=None, attention_mask=attention_mask, image=image)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/model/base_model.py", line 144, in forward
    return self.transformer(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/model/transformer.py", line 451, in forward
    hidden_states = self.hooks['word_embedding_forward'](input_ids, output_cross_layer=output_cross_layer, **kw_args)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/sat/model/official/vit_model.py", line 55, in word_embedding_forward
    embeddings = self.proj(images)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/media/zjkj/2t/yantao/software/anaconda3/envs/vglm/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: FIND was unable to find an engine to execute this computation

Please note that I found the version of my pytorch is 2.0. Dose VisualGLM-6B have something wrong with Pytorch 2.0?

cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

[2023-05-19 14:50:31,777] [INFO] [RANK 0] > successfully loaded /home/tony/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
欢迎使用 VisualGLM-6B 模型,输入图像URL或本地路径读图,继续输入内容对话,clear 重新开始,stop 终止程序
请输入图像路径或URL(回车进入纯文本对话): https://img.caixin.com/2023-05-13/168394947268597_480_320.jpg
cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

$ nvidia-smi
Fri May 19 14:52:34 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 L... On | 00000000:01:00.0 Off | N/A |
| N/A 42C P8 7W / 150W| 1MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Issue with API mode: unexpected keyword argument 'mems'

I'm trying to run the API mode. Copied model data from hugging face. Added the following to the api.py:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("data", trust_remote_code=True)
model = AutoModel.from_pretrained("data", trust_remote_code=True).half().cuda()
model = model.eval()

app = FastAPI()

*all HF model files are in local ./data/

after running the server, request it from curl:

curl -X POST -H "Content-Type: application/json" -d @temp.json http://127.0.0.1:8080

here's the error I got when trying to submit a sample request:

INFO:     127.0.0.1:35234 - "POST / HTTP/1.1" 500 Internal Server Error
Internal Server Errorroot@291f83eb6f53:/VisualGLM-6B# ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "/VisualGLM-6B/api.py", line 36, in visual_glm
    answer, history, _ = chat(None, model, tokenizer, input_text, history=history, image=input_image, \
  File "/VisualGLM-6B/model/chat.py", line 141, in chat
    output = filling_sequence(
  File "/usr/local/lib/python3.10/dist-packages/sat/generation/autoregressive_sampling.py", line 108, in filling_sequence
    logits, *output_per_layers = model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: ChatGLMForConditionalGenerationWithImage.forward() got an unexpected keyword argument 'mems'

What did I do wrong? How can I get the API up and running?

Thanks

finetune原文件运行三处报错

(visualGLM) root@iZbp1ewp3ew1qt4u8bdh0iZ:~/ai/VisualGLM-6B# bash finetune/finetune_visualglm.sh
finetune/finetune_visualglm.sh: line 5: $'\r': command not found
finetune/finetune_visualglm.sh: line 14: $'\r': command not found
finetune/finetune_visualglm.sh: line 19: $'\r': command not found
finetune/finetune_visualglm.sh: line 22: $'\r': command not found
finetune/finetune_visualglm.sh: line 23: $'\r': command not found
finetune/finetune_visualglm.sh: line 50: $'\r': command not found
finetune/finetune_visualglm.sh: line 51: $'\r': command not found
finetune/finetune_visualglm.sh: line 52: $'\r': command not found
--use_lorat \20 \ 8 \\s \ \dataset.json hostfile_single
[2023-05-23 17:22:18,395] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-23 17:22:18,412] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --m --use_lorat 20 e 8 ns l /dataset.json--enable_each_rank_log=None finetune_visualglm.py
[2023-05-23 17:22:21,237] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=0
[2023-05-23 17:22:21,237] [INFO] [launch.py:222:main] 0 NCCL_DEBUG=info
[2023-05-23 17:22:21,237] [INFO] [launch.py:222:main] 0 NCCL_NET_GDR_LEVEL=2
[2023-05-23 17:22:21,237] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-23 17:22:21,237] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-23 17:22:21,237] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-23 17:22:21,237] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-23 17:22:21,237] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
usage: finetune_visualglm.py [-h] [--num-layers NUM_LAYERS] [--hidden-size HIDDEN_SIZE] [--num-attention-heads NUM_ATTENTION_HEADS]
[--vocab-size VOCAB_SIZE] [--max-sequence-length MAX_SEQUENCE_LENGTH] [--layernorm-order {post,pre,sandwich}]
[--inner-hidden-size INNER_HIDDEN_SIZE] [--hidden-size-per-attention-head HIDDEN_SIZE_PER_ATTENTION_HEAD]
[--model-parallel-size MODEL_PARALLEL_SIZE] [--skip-init] [--use-gpu-initialization]
[--layernorm-epsilon LAYERNORM_EPSILON] [--hidden-dropout HIDDEN_DROPOUT] [--attention-dropout ATTENTION_DROPOUT]
[--make-vocab-size-divisible-by MAKE_VOCAB_SIZE_DIVISIBLE_BY] [--experiment-name EXPERIMENT_NAME]
[--train-iters TRAIN_ITERS] [--batch-size BATCH_SIZE] [--lr LR] [--mode {pretrain,finetune,inference}] [--seed SEED]
[--zero-stage {0,1,2}] [--checkpoint-activations] [--checkpoint-num-layers CHECKPOINT_NUM_LAYERS] [--fp16] [--bf16]
[--gradient-accumulation-steps GRADIENT_ACCUMULATION_STEPS] [--epochs EPOCHS] [--log-interval LOG_INTERVAL]
[--summary-dir SUMMARY_DIR] [--save-args] [--lr-decay-iters LR_DECAY_ITERS]
[--lr-decay-style {constant,linear,cosine,exponential}] [--lr-decay-ratio LR_DECAY_RATIO] [--warmup WARMUP]
[--weight-decay WEIGHT_DECAY] [--save SAVE] [--load LOAD] [--save-interval SAVE_INTERVAL] [--no-save-rng]
[--no-load-rng] [--resume-dataloader] [--distributed-backend DISTRIBUTED_BACKEND] [--local_rank LOCAL_RANK]
[--exit-interval EXIT_INTERVAL] [--eval-batch-size EVAL_BATCH_SIZE] [--eval-iters EVAL_ITERS]
[--eval-interval EVAL_INTERVAL] [--strict-eval] [--train-data TRAIN_DATA [TRAIN_DATA ...]]
[--train-data-weights TRAIN_DATA_WEIGHTS [TRAIN_DATA_WEIGHTS ...]] [--iterable-dataset] [--valid-data [VALID_DATA ...]]
[--test-data [TEST_DATA ...]] [--split SPLIT] [--num-workers NUM_WORKERS] [--block-size BLOCK_SIZE]
[--tokenizer-type TOKENIZER_TYPE] [--temperature TEMPERATURE] [--top_p TOP_P] [--top_k TOP_K] [--num-beams NUM_BEAMS]
[--length-penalty LENGTH_PENALTY] [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE] [--min-tgt-length MIN_TGT_LENGTH]
[--out-seq-length OUT_SEQ_LENGTH] [--input-source INPUT_SOURCE] [--output-path OUTPUT_PATH] [--with-id]
[--max-inference-batch-size MAX_INFERENCE_BATCH_SIZE] [--device DEVICE] [--deepspeed]
[--deepspeed_config DEEPSPEED_CONFIG] [--deepscale] [--deepscale_config DEEPSCALE_CONFIG] [--deepspeed_mpi]
--use_lorasualglm.py: error: unrecognized arguments:
[2023-05-23 17:22:26,242] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 128438
[2023-05-23 17:22:26,243] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'finetune_visualglm.py', '--local_rank=0', '\r', '--experiment-name', 'finetune-visualglm-6b\r', '\r', '--model-parallel-size', '1\r', '\r', '--mode', 'finetune', '\r', '--train-iters', '300', '\r', '--resume-dataloader', '\r', '--max_source_length', '64', '\r', '--max_target_length', '256', '\r', '--lora_rank', '10\r', '--pre_seq_len', '4\r', '\r', '--train-data', './fewshot-data/dataset.json\r', '\r', '--valid-data', './fewshot-data/dataset.json\r', '\r', '--distributed-backend', 'nccl', '\r', '--lr-decay-style', 'cosine', '\r', '--warmup', '.02', '\r', '--checkpoint-activations', '\r', '--save-interval', '300', '\r', '--eval-interval', '10000', '\r', '--save', './checkpoints', '\r', '--split', '1', '\r', '--eval-iters', '10', '\r', '--eval-batch-size', '8', '\r', '--zero-stage', '1', '\r', '--lr', '0.0001', '\r', '--batch-size', '20', '\r', '--skip-init', '\r', '--fp16', '\r', '--use_lora\r', '\r\r\r'] exits with return code = 2
finetune/finetune_visualglm.sh: line 56: $'\r': command not found
: invalid optione_visualglm.sh: line 57: set: +
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]

linux环境缺少sat包且pip无法安装成功

$ python web_demo_hf.py
Traceback (most recent call last):
File "web_demo_hf.py", line 6, in
model = AutoModel.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True).half().cuda()
File "/home/good/anaconda3/envs/visualglm/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 459, in from_pretrained
model_class = get_class_from_dynamic_module(
File "/home/good/anaconda3/envs/visualglm/lib/python3.8/site-packages/transformers/dynamic_module_utils.py", line 425, in get_class_from_dynamic_module
final_module = get_cached_module_file(
File "/home/good/anaconda3/envs/visualglm/lib/python3.8/site-packages/transformers/dynamic_module_utils.py", line 305, in get_cached_module_file
get_cached_module_file(
File "/home/good/anaconda3/envs/visualglm/lib/python3.8/site-packages/transformers/dynamic_module_utils.py", line 267, in get_cached_module_file
modules_needed = check_imports(resolved_module_file)
File "/home/good/anaconda3/envs/visualglm/lib/python3.8/site-packages/transformers/dynamic_module_utils.py", line 150, in check_imports
raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: sat. Run pip install sat

打字机页面推理速度很慢

web_demo_hf.py 推理速度很慢,比web_demo.py推理速度慢很多;
于是我直接在 Jupyter 配置同样的代码跑hugging face的模型,去掉网页后推理速度是没问题的,但是在网页上推理速度就很慢,实在不知道为什么了,对gradio不是非常熟悉

fintune错误

Traceback (most recent call last):
File "finetune_visualglm.py", line 170, in
args = get_args(args_list)
File "/root/miniconda3/lib/python3.8/site-packages/sat/arguments.py", line 417, in get_args
initialize_distributed(args)
File "/root/miniconda3/lib/python3.8/site-packages/sat/arguments.py", line 500, in initialize_distributed
deepspeed.init_distributed(
TypeError: init_distributed() got an unexpected keyword argument 'world_size'

话说能支持下苹果的MPS吗,现在mac m2上运行报错

image
❯ python web_demo.py

[2023-05-21 21:29:01,122] [INFO] DeepSpeed/CUDA is not installed, fallback to Pytorch checkpointing.
[2023-05-21 21:29:01,599] [WARNING] Failed to load cpm_kernels:Unknown platform: darwin
[2023-05-21 21:29:01,601] [INFO] building VisualGLMModel model ...
59203
[2023-05-21 21:29:01,625] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-05-21 21:29:01,627] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.


/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
[2023-05-21 21:29:13,787] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 7810582016
[2023-05-21 21:29:14,203] [INFO] [RANK 0] Torch not compiled with CUDA enabled
[2023-05-21 21:29:14,203] [INFO] [RANK 0] global rank 0 is loading checkpoint /Users/z/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-21 21:29:28,809] [INFO] [RANK 0] > successfully loaded /Users/z/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
Traceback (most recent call last):
  File "/Users/z/git/VisualGLM-6B/web_demo.py", line 128, in <module>
    main(args)
  File "/Users/z/git/VisualGLM-6B/web_demo.py", line 81, in main
    model, tokenizer = get_infer_setting(gpu_device=0, quant=args.quant)
  File "/Users/z/git/VisualGLM-6B/model/infer_util.py", line 27, in get_infer_setting
    model = model.cuda()
  File "/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/nn/modules/module.py", line 905, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/nn/modules/module.py", line 905, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/Users/z/git/VisualGLM-6B/.direnv/python-3.10.11/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

能否列出具体的硬件资源要求?

我尝试用colab运行 web_demo_hf.py, 执行到 Loading checkpoint shards: 0% 0/5 [00:00<?, ?it/s]^C 直接退出了,看系统内存有个尖峰,应该是超过了默认的12.7GB,我想了解下运行这个模型最低的硬件要求是多少,比如类似下面的具体描述,这样我好找具体的机器来部署,谢谢.

vCPU:
RAM:
GPU RAM:

modeling_chatglm.py里self.dtype具体是指?

您好,我最近在用visualglm做reward model的训练,在修改和查看代码的时候发现modeling_chatglm.py里有一行:torch_image = torch_image.to(self.dtype).to(self.device),请问这个self.dtype具体是指?我在代码里没有找到相关的定义

web_demo_hf.py报错:RuntimeError: GET was unable to find an engine to execute this computation

在上传图片之后,运行报错,错误信息如下:
Traceback (most recent call last):
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/gradio/routes.py", line 412, in run_predict
output = await app.get_blocks().process_api(
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/gradio/blocks.py", line 1299, in process_api
result = await self.call_function(
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/gradio/blocks.py", line 1035, in call_function
prediction = await anyio.to_thread.run_sync(
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/gradio/utils.py", line 491, in async_iteration
return next(iterator)
File "/mnt/amj/VisualGLM-6B/web_demo_hf.py", line 63, in predict
for response, history in model.stream_chat(tokenizer, image_path, input, history, max_length=max_length, top_p=top_p,
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/root/.cache/huggingface/modules/transformers_modules/visualglm-6b/modeling_chatglm.py", line 1439, in stream_chat
for outputs in self.stream_generate(**inputs, **gen_kwargs):
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/root/.cache/huggingface/modules/transformers_modules/visualglm-6b/modeling_chatglm.py", line 1291, in stream_generate
outputs = self(
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/visualglm-6b/modeling_chatglm.py", line 1462, in forward
image_embeds = self.image_encoder(images)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/visualglm-6b/visual.py", line 69, in forward
enc = self.vit(image)[0]
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/visualglm-6b/visual.py", line 28, in forward
return super().forward(input_ids=input_ids, position_ids=None, attention_mask=attention_mask, image=image)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/sat/model/base_model.py", line 144, in forward
return self.transformer(*args, **kwargs)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/sat/model/transformer.py", line 451, in forward
hidden_states = self.hooks['word_embedding_forward'](input_ids, output_cross_layer=output_cross_layer, **kw_args)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/sat/model/official/vit_model.py", line 55, in word_embedding_forward
embeddings = self.proj(images)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/mnt/amj/conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

tuning 遇到了 RuntimeError: Error building extension 'fused_adam'

tuning 时用了默认的指令,出现下了如下错误

2023-05-22 16:51:07,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu117/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++17 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++17 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
nvcc fatal : Value 'c++17' is not defined for option 'std'
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/torch20/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/nfs_data/VisualGLM-6B-main/finetune_visualglm.py", line 188, in
training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 98, in training_main
model, optimizer = setup_model_untrainable_params_and_optimizer(args, model)
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 161, in setup_model_untrainable_params_and_optimizer
model, optimizer, _, _ = deepspeed.initialize(
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1224, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
VM-3-158-ubuntu:1785083:1800122 [0] NCCL INFO [Service thread] Connection closed by localRank 0
VM-3-158-ubuntu:1785083:1785083 [0] NCCL INFO comm 0x8abbc410 rank 0 nranks 1 cudaDev 0 busId 80 - Abort COMPLETE
VM-3-158-ubuntu:1785083:1800126 [0] NCCL INFO [Service thread] Connection closed by localRank 0
VM-3-158-ubuntu:1785083:1785083 [0] NCCL INFO comm 0x8abc35b0 rank 0 nranks 1 cudaDev 0 busId 80 - Abort COMPLETE
[2023-05-22 16:51:50,540] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1785083
[2023-05-22 16:51:50,540] [ERROR] [launch.py:434:sigkill_handler] ['/root/anaconda3/envs/torch20/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '20', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

可以直接用VisualGLM-6b进行reward model的训练吗?

您好,想用VisualGLM-6b进行reward model的训练,目前输入数据是纯文本,自己照着deepspeed_chat改了一下,发现在计算时总出错,具体log如下:
File "/opt/conda/envs/rlhf_tw_test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/rlhf_tw_test/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x130344 and 4096x1)

求一个quantize好的模型

不是显存不够,是内存也不够,无法先加载模型再进行quantize.
有没有quantize 8bit或者4bit的教程?谢谢

运行finetune报错

NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 20 --skip-init --fp16 --use_lora
finetune/finetune_visualglm.sh: line 56: deepspeed: command not found

已经尝试过升级deepspeed,还是报错
目前deepspeed版本 0.9.2

使用CPU加载模型推理时出现如下报错

Traceback (most recent call last):
File "C:\ProgramData\anaconda3\envs\chatglm\lib\site-packages\gradio\routes.py", line 401, in run_predict
output = await app.get_blocks().process_api(
File "C:\ProgramData\anaconda3\envs\chatglm\lib\site-packages\gradio\blocks.py", line 1302, in process_api
result = await self.call_function(
File "C:\ProgramData\anaconda3\envs\chatglm\lib\site-packages\gradio\blocks.py", line 1039, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\ProgramData\anaconda3\envs\chatglm\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\ProgramData\anaconda3\envs\chatglm\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\ProgramData\anaconda3\envs\chatglm\lib\site-packages\anyio_backends_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\ProgramData\anaconda3\envs\chatglm\lib\site-packages\gradio\utils.py", line 491, in async_iteration
return next(iterator)
File "D:\Code\VisualGLM-6B-main\web_demo_hf.py", line 56, in predict
chatbot.append((parse_text(input), ""))
AttributeError: 'NoneType' object has no attribute 'append'求大神解决,感觉CPU部署还是有问题

本地加载模型室提示找不到 model_config.json

如题,使用的虚拟环境,已经安装所需依赖。在 HuggingFace 上下载的模型,放置到本地文件夹 /data/models/THUDM/visualglm-6b 中。修改 cli_demo.py:

def main():
    ...
    # load model
    model, model_args = VisualGLMModel.from_pretrained(
        "/data/models/THUDM/visualglm-6b",
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
        local_files_only=1,
    ))
    ...
    tokenizer = AutoTokenizer.from_pretrained("/data/models/THUDM/visualglm-6b", local_files_only=1, trust_remote_code=True)
    ...

2023-05-18_18-47-23_VisualGLM-6B – cli_demo py

运行 cli_demo.py 时报错如下:

Traceback (most recent call last):
  File "/data/VisualGLM-6B/cli_demo.py", line 100, in <module>
    main()
  File "/data/VisualGLM-6B/cli_demo.py", line 25, in main
    model, model_args = VisualGLMModel.from_pretrained(
  File "/data/VisualGLM-6B/.venv/lib/python3.9/site-packages/sat/model/base_model.py", line 212, in from_pretrained
    args = update_args_with_file(args, path=os.path.join(model_path, 'model_config.json'))
  File "/data/VisualGLM-6B/.venv/lib/python3.9/site-packages/sat/arguments.py", line 423, in update_args_with_file
    with open(path, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/models/THUDM/visualglm-6b/model_config.json'

2023-05-18_18-47-52_Administrator Windows PowerShell

放置模型的目录结构如下:
2023-05-18_18-45-43_visualglm-6b

如果要基于 chatglm-6b 来训练一个多模态语言模型,glm 的 token部分该怎么处理呢?

最近想模仿微软的 LLaMA 结构训练一个多模态语言模型,也就是要把图像的 token 向量和文本的 embedding 向量拼成一段话输入 chatglm。目前看网上的微调代码都是基于 token 的编码 input_ids 输入模型的,大致看了仓库的代码,这个项目貌似是把图像和文本通过向量输入的,如果是 chatglm,怎么使用嵌入向量输入模型呀,因为我看预测 token 也是预测的 token id,然后再把预测的 id 和输入拼起来再预测下一个。向量输入的话,是把预测 id 的嵌入向量得到再和输入向量拼起来吗?

transformers version capitable with requirements.txt

My last issue is ambiguous. Sorry about that. Basically, I get this traceback for several transformers versions that meet transformers>=2.27.1. So to not trigger this error, which transformers version on earth did you use? Or are there any other conflicts with package versions? Or something else that originally trigger this error?

Traceback (most recent call last):
  File "web_demo_hf.py", line 5, in <module>
    tokenizer = AutoTokenizer.from_pretrained("./vglm-6b", trust_remote_code=True)
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 663, in from_pretrained
    tokenizer_class = get_class_from_dynamic_module(
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\dynamic_module_utils.py", line 399, in get_class_from_dynamic_module
    return get_class_in_module(class_name, final_module.replace(".py", ""))
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\dynamic_module_utils.py", line 177, in get_class_in_module
    module = importlib.import_module(module_path)
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers_modules.'

windows 安装不了依赖

ERROR: Command errored out with exit status 1: command: 'c:\program files\python\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Administrator\AppData\Local\Temp\pip-install-iozsj5rb\deepspeed_cd9c08b77eaf40568b910542cdc41a19\setup.py'"'"'; file='"'"'C:\Users\Administrator\AppData\Local\Temp\pip-install-iozsj5rb\deepspeed_cd9c08b77eaf40568b910542cdc41a19\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Administrator\AppData\Local\Temp\pip-pip-egg-info-sdfey28f' cwd: C:\Users\Administrator\AppData\Local\Temp\pip-install-iozsj5rb\deepspeed_cd9c08b77eaf40568b910542cdc41a19\ Complete output (14 lines): test.c LINK : fatal error LNK1181: cannot open input file 'aio.lib' Traceback (most recent call last): File "", line 1, in File "C:\Users\Administrator\AppData\Local\Temp\pip-install-iozsj5rb\deepspeed_cd9c08b77eaf40568b910542cdc41a19\setup.py", line 162, in abort(f"Unable to pre-compile {op_name}") File "C:\Users\Administrator\AppData\Local\Temp\pip-install-iozsj5rb\deepspeed_cd9c08b77eaf40568b910542cdc41a19\setup.py", line 51, in abort assert False, msg AssertionError: Unable to pre-compile async_io DS_BUILD_OPS=1

求问使用的 chatGLM 的版本

求问 visualGLM后面接入的 chatGLM 是base哪个版本的~ 是最早释放的v0.1.0版本吗

我看到chatGLM 最近才 release 了 v1.1 版本 的checkpoint~

另外请问下什么时候会考虑开放全部的训练数据吗~

ERROR: Unknown arg use_final_layernorm

This is my code:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True).half().cuda()

It runs failed, and returned "ValueError: Unknown arg use_final_layernorm."

What's the problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.