Comments (14)
这应该是下载的checkpoint错误导致的,需要使用sat的checkpoint(默认会从清华云盘下载),你下载的应该是huggingface的版本,没有model_config.json文件
from visualglm-6b.
可以尝试更新deepspeed版本
from visualglm-6b.
这个问题升级版本可以解决,不过又出现另一个问题
Traceback (most recent call last):
File "finetune_visualglm.py", line 175, in
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args)
File "/root/miniconda3/lib/python3.8/site-packages/sat/model/base_model.py", line 212, in from_pretrained
args = update_args_with_file(args, path=os.path.join(model_path, 'model_config.json'))
File "/root/miniconda3/lib/python3.8/site-packages/sat/arguments.py", line 423, in update_args_with_file
with open(path, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'visualglm-6b/model_config.json'
from visualglm-6b.
[2023-05-23 10:55:58,501] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31621
[2023-05-23 10:55:59,931] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31622
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31623
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31624
[2023-05-23 10:56:02,826] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31626
[2023-05-23 10:56:04,334] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31628
[2023-05-23 10:56:05,841] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31630
[2023-05-23 10:56:07,428] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31632
[2023-05-23 10:56:08,935] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=7', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '20', '--skip-init', '--fp16', '--use_lora'] exits with return code = -9
from visualglm-6b.
能贴一下上面更多的错误信息吗?也许是显存不够,可以尝试调小batch size。
from visualglm-6b.
$sh finetune/finetune_visualglm.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 20 --skip-init --fp16 --use_lora
[2023-05-23 10:48:02,235] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-23 10:48:02,292] [INFO] [runner.py:541:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 20 --skip-init --fp16 --use_lora
[2023-05-23 10:48:08,322] [INFO] [launch.py:222:main] 0 NCCL_DEBUG=info
[2023-05-23 10:48:08,322] [INFO] [launch.py:222:main] 0 NCCL_NET_GDR_LEVEL=2
[2023-05-23 10:48:08,322] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=0
[2023-05-23 10:48:08,322] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-23 10:48:08,322] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-23 10:48:08,322] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-23 10:48:08,322] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-23 10:48:08,322] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-05-23 10:48:16,792] [INFO] using world size: 8 and model-parallel size: 1
[2023-05-23 10:48:16,792] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
16666
16666
16666
16666
16666
16666
16666
16666
[2023-05-23 10:48:17,565] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-05-23 10:48:17,645] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,647] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,647] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,647] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,648] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2023-05-23 10:48:17,648] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2023-05-23 10:48:17,650] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,651] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,653] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,654] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,668] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
[2023-05-23 10:48:33,416] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376
[2023-05-23 10:48:47,042] [INFO] [RANK 0] global rank 0 is loading checkpoint /mnt/benteng.bt/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-23 10:55:58,501] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31621
[2023-05-23 10:55:59,931] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31622
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31623
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31624
[2023-05-23 10:56:02,826] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31626
[2023-05-23 10:56:04,334] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31628
[2023-05-23 10:56:05,841] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31630
[2023-05-23 10:56:07,428] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31632
[2023-05-23 10:56:08,935] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=7', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '20', '--skip-init', '--fp16', '--use_lora'] exits with return code = -9
from visualglm-6b.
$nvidia-smi
Tue May 23 11:28:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:4F:00.0 Off | 0 |
| N/A 36C P0 52W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:50:00.0 Off | 0 |
| N/A 37C P0 54W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:5F:00.0 Off | 0 |
| N/A 38C P0 52W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:60:00.0 Off | 0 |
| N/A 36C P0 55W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:B1:00.0 Off | 0 |
| N/A 36C P0 55W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:B2:00.0 Off | 0 |
| N/A 35C P0 54W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:DB:00.0 Off | 0 |
| N/A 35C P0 52W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:DC:00.0 Off | 0 |
| N/A 35C P0 54W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
from visualglm-6b.
能贴一下上面更多的错误信息吗?也许是显存不够,可以尝试调小batch size。
内存是足够的,上边错误没看见提示哪里错误了
from visualglm-6b.
我这边峰值显存是需要37G,可以把batch调小一点试试,如果是batch=1需要18G显存。
from visualglm-6b.
更新deepspeed版本
怎么更新deepspeed版本?可以告知一下命令么
from visualglm-6b.
@magicwang1111 pip install --upgrade deepspeed
from visualglm-6b.
@magicwang1111
pip install --upgrade deepspeed
谢谢
from visualglm-6b.
我这个是不是没有使用GPU训练,是什么原因导致的?
[2023-05-24 17:16:13,459] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4]}
[2023-05-24 17:16:13,459] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=5, node_rank=0
[2023-05-24 17:16:13,459] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4]})[2023-05-24 17:16:13,459] [INFO] [launch.py:247:main] dist_world_size=5
[2023-05-24 17:16:13,459] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4
16666
16666
[2023-05-24 17:16:17,866] [INFO] using world size: 5 and model-parallel size: 1
[2023-05-24 17:16:17,866] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
16666
16666
16666
[2023-05-24 17:16:18,841] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,894] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,895] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2023-05-24 17:16:18,895] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
[2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
[2023-05-24 17:16:29,009] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376
[2023-05-24 17:16:39,527] [INFO] [RANK 4] CUDA out of memory. Tried to allocate 20.00 MiB (GPU 4; 3.82 GiB total capacity; 2.96 GiB already allocated; 18.62 MiB free; 3.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-24 17:16:41,348] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-24 17:16:50,216] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-24 17:16:50,688] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers...
[2023-05-24 17:16:51,354] [INFO] [RANK 0] > Set tokenizer as a THUDM/chatglm-6b tokenizer! Now you can get_tokenizer() everywhere.
from visualglm-6b.
我这个是不是没有使用GPU训练,是什么原因导致的?
[2023-05-24 17:16:13,459] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4]} [2023-05-24 17:16:13,459] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=5, node_rank=0 [2023-05-24 17:16:13,459] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4]})[2023-05-24 17:16:13,459] [INFO] [launch.py:247:main] dist_world_size=5 [2023-05-24 17:16:13,459] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4 16666 16666 [2023-05-24 17:16:17,866] [INFO] using world size: 5 and model-parallel size: 1 [2023-05-24 17:16:17,866] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) 16666 16666 16666 [2023-05-24 17:16:18,841] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-05-24 17:16:18,894] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-05-24 17:16:18,895] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-05-24 17:16:18,895] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... [2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead /root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2023-05-24 17:16:29,009] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-05-24 17:16:39,527] [INFO] [RANK 4] CUDA out of memory. Tried to allocate 20.00 MiB (GPU 4; 3.82 GiB total capacity; 2.96 GiB already allocated; 18.62 MiB free; 3.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-05-24 17:16:41,348] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-05-24 17:16:50,216] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-05-24 17:16:50,688] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... [2023-05-24 17:16:51,354] [INFO] [RANK 0] > Set tokenizer as a THUDM/chatglm-6b tokenizer! Now you can get_tokenizer() everywhere.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:01:00.0 Off | 0 |
| N/A 64C P0 84W / 275W | 16003MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:47:00.0 Off | 0 |
| N/A 63C P0 86W / 275W | 16003MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:81:00.0 Off | 0 |
| N/A 63C P0 81W / 275W | 16003MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA DGX Display Off | 00000000:C1:00.0 Off | N/A |
| 42% 57C P8 N/A / 50W | 3893MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... Off | 00000000:C2:00.0 Off | 0 |
| N/A 62C P0 80W / 275W | 16003MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
显卡占用
from visualglm-6b.
Related Issues (20)
- Lora微调返回代码-7
- 微调问题,微调后模型遗忘
- 烦请帮忙看看,微调后运行cli_demo.py出现维度不一致问题; RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0 HOT 2
- 全代码开源会有吗
- 运行finetune代码时,报错没有model_config.json
- 报错cannot import name 'builder' from 'google.protobuf.internal'
- visualglm进行QLoRA微调时报错,RuntimeError: mat1 and mat2 shapes cannot be multiplied (320x4096 and 1x25165824) [2024-03-07 07:26:17,037] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 13210 HOT 5
- 请问怎么把finetune后的模型转成onnx格式呢?
- 微调之后加载web_demo时的报错 HOT 2
- 请问在阿里云上部署,连接不上huggingface网站的问题怎么解决呀? HOT 6
- 关于只用文本数据集微调
- Lora微调报错,fp16 is not supported HOT 1
- 请问可以实现用qlora+model parallel 吗 HOT 1
- qlora merge lora weights error
- 多图推理
- python web_demo.py报错 HOT 1
- 运行web_demo_hf.py报错
- finetune的时候出现模型加载失败
- finetune的时候加载模型失败
- 'Chatbot' object has no attribute 'style'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from visualglm-6b.