Before Asking 在提问之前

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

在deepspeed中的有提到类似的问题<a href="https://github.com/microsoft/DeepSpeed/issues/3463" data-

我把环境从本地换成了docker实例，尝试了offload到cpu和硬盘上，都会失败:( offload到cpu上

FT-Data Ranker-1b OOM finetuning on single GPU about data-juicer HOT 6 CLOSED

xnuohz commented on May 18, 2024

FT-Data Ranker-1b OOM finetuning on single GPU

from data-juicer.

Comments (6)

zhijianma commented on May 18, 2024

建议将deepspeed的config文件更换为ds_config_stage3_offload-opt.json 或者ds_config_stage3_offload-opt_offload-para.json ，不同配置下训练时间也有所差异，可根据自身GPU 资源选择。
我们在T4 和A100 进行了测试，资源消耗参考如下：

ds_config	T4(16G fp16) GPU Memory	A100(40G bf16) GPU Memory
ds_config_stage3_offload-opt.json	~14428MiB	~8588MiB
ds_config_stage3_offload-opt_offload-para.json	~8204MiB	~8682MiB

from data-juicer.

xnuohz commented on May 18, 2024

@zhijianma 感谢回复。将deepspeed的config文件更换为ds_config_stage3_offload-opt.json 或者ds_config_stage3_offload-opt_offload-para.json之后，会出现下面的错误。我的环境是PyTorch 2.1.0/CUDA 11.6
（可能torch的cuda版本和本地的cuda版本不一致）

ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/ubuntu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.2267723083496094 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 12:49:31,065] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 179742
[2023-10-20 12:49:31,067] [ERROR] [launch.py:321:sigkill_handler] ['/home/ubuntu/Softwares/anaconda3/envs/dj_comp/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9

from data-juicer.

zhijianma commented on May 18, 2024

在deepspeed中的有提到类似的问题#3463, #3824, #2788，大家觉得在offload开启后，在一些模型上引发了CPU的OOM问题。
这里的话，我觉得可以尝试在ds_config中将pin_memory设置为False，

"offload_param": {
    "device": "cpu",
    "pin_memory": false

如果上述配置仍然有问题，可以继续尝试将其offload 到硬盘上，

"offload_param": {
    "device": "nvme",
    "nvme_path": "/your_nvme_path",

from data-juicer.

xnuohz commented on May 18, 2024

我把环境从本地换成了docker实例，尝试了offload到cpu和硬盘上，都会失败:(

offload到cpu上

ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu117/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 17.70175266265869 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 15:00:32,785] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 59
[2023-10-20 15:00:32,790] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9

offload到硬盘上

Loading model from ../data/models/falcon-rw-1b
Traceback (most recent call last):
  File "/workspace/competition_kit/lm-training/train.py", line 465, in <module>
    train()
  File "/workspace/competition_kit/lm-training/train.py", line 360, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2961, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 924, in __init__
    self.param_swapper = param_swapper or AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
    aio_op = AsyncIOBuilder().load(verbose=False)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
    return self.jit_load(verbose)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 450, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue. None
[2023-10-20 15:04:40,690] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 398
[2023-10-20 15:04:40,691] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = 1

from data-juicer.

github-actions commented on May 18, 2024

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

from data-juicer.

github-actions commented on May 18, 2024

Close this stale issue.

from data-juicer.

FT-Data Ranker-1b OOM finetuning on single GPU about data-juicer HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent