Giter VIP home page Giter VIP logo

Comments (14)

1049451037 avatar 1049451037 commented on May 28, 2024 1

这应该是下载的checkpoint错误导致的,需要使用sat的checkpoint(默认会从清华云盘下载),你下载的应该是huggingface的版本,没有model_config.json文件

from visualglm-6b.

1049451037 avatar 1049451037 commented on May 28, 2024

可以尝试更新deepspeed版本

from visualglm-6b.

freelancerllm avatar freelancerllm commented on May 28, 2024

这个问题升级版本可以解决,不过又出现另一个问题
Traceback (most recent call last):
File "finetune_visualglm.py", line 175, in
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args)
File "/root/miniconda3/lib/python3.8/site-packages/sat/model/base_model.py", line 212, in from_pretrained
args = update_args_with_file(args, path=os.path.join(model_path, 'model_config.json'))
File "/root/miniconda3/lib/python3.8/site-packages/sat/arguments.py", line 423, in update_args_with_file
with open(path, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'visualglm-6b/model_config.json'

from visualglm-6b.

freelancerllm avatar freelancerllm commented on May 28, 2024

[2023-05-23 10:55:58,501] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31621
[2023-05-23 10:55:59,931] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31622
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31623
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31624
[2023-05-23 10:56:02,826] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31626
[2023-05-23 10:56:04,334] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31628
[2023-05-23 10:56:05,841] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31630
[2023-05-23 10:56:07,428] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31632
[2023-05-23 10:56:08,935] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=7', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '20', '--skip-init', '--fp16', '--use_lora'] exits with return code = -9

from visualglm-6b.

1049451037 avatar 1049451037 commented on May 28, 2024

能贴一下上面更多的错误信息吗?也许是显存不够,可以尝试调小batch size。

from visualglm-6b.

freelancerllm avatar freelancerllm commented on May 28, 2024

$sh finetune/finetune_visualglm.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 20 --skip-init --fp16 --use_lora
[2023-05-23 10:48:02,235] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-23 10:48:02,292] [INFO] [runner.py:541:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 20 --skip-init --fp16 --use_lora
[2023-05-23 10:48:08,322] [INFO] [launch.py:222:main] 0 NCCL_DEBUG=info
[2023-05-23 10:48:08,322] [INFO] [launch.py:222:main] 0 NCCL_NET_GDR_LEVEL=2
[2023-05-23 10:48:08,322] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=0
[2023-05-23 10:48:08,322] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-23 10:48:08,322] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-23 10:48:08,322] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-23 10:48:08,322] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-23 10:48:08,322] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-05-23 10:48:16,792] [INFO] using world size: 8 and model-parallel size: 1
[2023-05-23 10:48:16,792] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
16666
16666
16666
16666
16666
16666
16666
16666
[2023-05-23 10:48:17,565] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-05-23 10:48:17,645] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,647] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,647] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,647] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,648] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2023-05-23 10:48:17,648] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2023-05-23 10:48:17,650] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,651] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,653] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,654] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-23 10:48:17,668] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.8/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
[2023-05-23 10:48:33,416] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376
[2023-05-23 10:48:47,042] [INFO] [RANK 0] global rank 0 is loading checkpoint /mnt/benteng.bt/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-23 10:55:58,501] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31621
[2023-05-23 10:55:59,931] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31622
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31623
[2023-05-23 10:56:01,359] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31624
[2023-05-23 10:56:02,826] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31626
[2023-05-23 10:56:04,334] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31628
[2023-05-23 10:56:05,841] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31630
[2023-05-23 10:56:07,428] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31632
[2023-05-23 10:56:08,935] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=7', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '20', '--skip-init', '--fp16', '--use_lora'] exits with return code = -9

from visualglm-6b.

freelancerllm avatar freelancerllm commented on May 28, 2024

$nvidia-smi
Tue May 23 11:28:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:4F:00.0 Off | 0 |
| N/A 36C P0 52W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:50:00.0 Off | 0 |
| N/A 37C P0 54W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:5F:00.0 Off | 0 |
| N/A 38C P0 52W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:60:00.0 Off | 0 |
| N/A 36C P0 55W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:B1:00.0 Off | 0 |
| N/A 36C P0 55W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:B2:00.0 Off | 0 |
| N/A 35C P0 54W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:DB:00.0 Off | 0 |
| N/A 35C P0 52W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:DC:00.0 Off | 0 |
| N/A 35C P0 54W / 300W | 16092MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

from visualglm-6b.

freelancerllm avatar freelancerllm commented on May 28, 2024

能贴一下上面更多的错误信息吗?也许是显存不够,可以尝试调小batch size。

内存是足够的,上边错误没看见提示哪里错误了

from visualglm-6b.

1049451037 avatar 1049451037 commented on May 28, 2024

我这边峰值显存是需要37G,可以把batch调小一点试试,如果是batch=1需要18G显存。

from visualglm-6b.

magicwang1111 avatar magicwang1111 commented on May 28, 2024

更新deepspeed版本

怎么更新deepspeed版本?可以告知一下命令么

from visualglm-6b.

1049451037 avatar 1049451037 commented on May 28, 2024

@magicwang1111 pip install --upgrade deepspeed

from visualglm-6b.

magicwang1111 avatar magicwang1111 commented on May 28, 2024

@magicwang1111 pip install --upgrade deepspeed

谢谢

from visualglm-6b.

WangRongsheng avatar WangRongsheng commented on May 28, 2024

我这个是不是没有使用GPU训练,是什么原因导致的?

[2023-05-24 17:16:13,459] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4]}
[2023-05-24 17:16:13,459] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=5, node_rank=0
[2023-05-24 17:16:13,459] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4]})[2023-05-24 17:16:13,459] [INFO] [launch.py:247:main] dist_world_size=5
[2023-05-24 17:16:13,459] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4
16666
16666
[2023-05-24 17:16:17,866] [INFO] using world size: 5 and model-parallel size: 1
[2023-05-24 17:16:17,866] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
16666
16666
16666
[2023-05-24 17:16:18,841] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,894] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,895] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2023-05-24 17:16:18,895] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
[2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
[2023-05-24 17:16:29,009] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 7811237376
[2023-05-24 17:16:39,527] [INFO] [RANK 4] CUDA out of memory. Tried to allocate 20.00 MiB (GPU 4; 3.82 GiB total capacity; 2.96 GiB already allocated; 18.62 MiB free; 3.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-24 17:16:41,348] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-24 17:16:50,216] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-24 17:16:50,688] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers...
[2023-05-24 17:16:51,354] [INFO] [RANK 0] > Set tokenizer as a THUDM/chatglm-6b tokenizer! Now you can get_tokenizer() everywhere.

from visualglm-6b.

WangRongsheng avatar WangRongsheng commented on May 28, 2024

我这个是不是没有使用GPU训练,是什么原因导致的?

[2023-05-24 17:16:13,459] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4]}
[2023-05-24 17:16:13,459] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=5, node_rank=0
[2023-05-24 17:16:13,459] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4]})[2023-05-24 17:16:13,459] [INFO] [launch.py:247:main] dist_world_size=5
[2023-05-24 17:16:13,459] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4
16666
16666
[2023-05-24 17:16:17,866] [INFO] using world size: 5 and model-parallel size: 1
[2023-05-24 17:16:17,866] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
16666
16666
16666
[2023-05-24 17:16:18,841] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,894] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2023-05-24 17:16:18,894] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,895] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2023-05-24 17:16:18,895] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
[2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-05-24 17:16:18,899] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
/root/miniconda3/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
[2023-05-24 17:16:29,009] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 7811237376
[2023-05-24 17:16:39,527] [INFO] [RANK 4] CUDA out of memory. Tried to allocate 20.00 MiB (GPU 4; 3.82 GiB total capacity; 2.96 GiB already allocated; 18.62 MiB free; 3.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-24 17:16:41,348] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-24 17:16:50,216] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-05-24 17:16:50,688] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers...
[2023-05-24 17:16:51,354] [INFO] [RANK 0] > Set tokenizer as a THUDM/chatglm-6b tokenizer! Now you can get_tokenizer() everywhere.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   64C    P0    84W / 275W |  16003MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:47:00.0 Off |                    0 |
| N/A   63C    P0    86W / 275W |  16003MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   63C    P0    81W / 275W |  16003MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA DGX Display  Off  | 00000000:C1:00.0 Off |                  N/A |
| 42%   57C    P8    N/A /  50W |   3893MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  Off  | 00000000:C2:00.0 Off |                    0 |
| N/A   62C    P0    80W / 275W |  16003MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

显卡占用

from visualglm-6b.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.