Giter VIP home page Giter VIP logo

Comments (6)

zhijianma avatar zhijianma commented on May 18, 2024

建议将deepspeed的config文件更换为ds_config_stage3_offload-opt.json 或者ds_config_stage3_offload-opt_offload-para.json ,不同配置下训练时间也有所差异,可根据自身GPU 资源选择。
我们在T4 和A100 进行了测试,资源消耗参考如下:

ds_config T4(16G fp16) GPU Memory A100(40G bf16) GPU Memory
ds_config_stage3_offload-opt.json ~14428MiB ~8588MiB
ds_config_stage3_offload-opt_offload-para.json ~8204MiB ~8682MiB

from data-juicer.

xnuohz avatar xnuohz commented on May 18, 2024

@zhijianma 感谢回复。将deepspeed的config文件更换为ds_config_stage3_offload-opt.json 或者ds_config_stage3_offload-opt_offload-para.json之后,会出现下面的错误。我的环境是PyTorch 2.1.0/CUDA 11.6
(可能torch的cuda版本和本地的cuda版本不一致)

ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/ubuntu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.2267723083496094 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 12:49:31,065] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 179742
[2023-10-20 12:49:31,067] [ERROR] [launch.py:321:sigkill_handler] ['/home/ubuntu/Softwares/anaconda3/envs/dj_comp/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9

from data-juicer.

zhijianma avatar zhijianma commented on May 18, 2024

在deepspeed中的有提到类似的问题#3463, #3824, #2788, 大家觉得在offload开启后,在一些模型上引发了CPU的OOM问题。
这里的话,我觉得可以尝试在ds_config中将pin_memory设置为False,

"offload_param": {
    "device": "cpu",
    "pin_memory": false

如果上述配置仍然有问题,可以继续尝试将其offload 到硬盘上,

"offload_param": {
    "device": "nvme",
    "nvme_path": "/your_nvme_path",

from data-juicer.

xnuohz avatar xnuohz commented on May 18, 2024

我把环境从本地换成了docker实例,尝试了offload到cpu和硬盘上,都会失败:(

  • offload到cpu上
ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu117/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 17.70175266265869 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 15:00:32,785] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 59
[2023-10-20 15:00:32,790] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9
  • offload到硬盘上
Loading model from ../data/models/falcon-rw-1b
Traceback (most recent call last):
  File "/workspace/competition_kit/lm-training/train.py", line 465, in <module>
    train()
  File "/workspace/competition_kit/lm-training/train.py", line 360, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2961, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 924, in __init__
    self.param_swapper = param_swapper or AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
    aio_op = AsyncIOBuilder().load(verbose=False)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
    return self.jit_load(verbose)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 450, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue. None
[2023-10-20 15:04:40,690] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 398
[2023-10-20 15:04:40,691] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = 1

from data-juicer.

github-actions avatar github-actions commented on May 18, 2024

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

from data-juicer.

github-actions avatar github-actions commented on May 18, 2024

Close this stale issue.

from data-juicer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.