Comments (6)
建议将deepspeed的config文件更换为ds_config_stage3_offload-opt.json
或者ds_config_stage3_offload-opt_offload-para.json
,不同配置下训练时间也有所差异,可根据自身GPU 资源选择。
我们在T4 和A100 进行了测试,资源消耗参考如下:
ds_config | T4(16G fp16) GPU Memory | A100(40G bf16) GPU Memory |
---|---|---|
ds_config_stage3_offload-opt.json | ~14428MiB | ~8588MiB |
ds_config_stage3_offload-opt_offload-para.json | ~8204MiB | ~8682MiB |
from data-juicer.
@zhijianma 感谢回复。将deepspeed的config文件更换为ds_config_stage3_offload-opt.json
或者ds_config_stage3_offload-opt_offload-para.json
之后,会出现下面的错误。我的环境是PyTorch 2.1.0/CUDA 11.6
(可能torch的cuda版本和本地的cuda版本不一致)
ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/ubuntu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.2267723083496094 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 12:49:31,065] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 179742
[2023-10-20 12:49:31,067] [ERROR] [launch.py:321:sigkill_handler] ['/home/ubuntu/Softwares/anaconda3/envs/dj_comp/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9
from data-juicer.
在deepspeed中的有提到类似的问题#3463, #3824, #2788, 大家觉得在offload开启后,在一些模型上引发了CPU的OOM问题。
这里的话,我觉得可以尝试在ds_config中将pin_memory设置为False,
"offload_param": {
"device": "cpu",
"pin_memory": false
如果上述配置仍然有问题,可以继续尝试将其offload 到硬盘上,
"offload_param": {
"device": "nvme",
"nvme_path": "/your_nvme_path",
from data-juicer.
我把环境从本地换成了docker实例,尝试了offload到cpu和硬盘上,都会失败:(
- offload到cpu上
ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu117/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 17.70175266265869 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 15:00:32,785] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 59
[2023-10-20 15:00:32,790] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9
- offload到硬盘上
Loading model from ../data/models/falcon-rw-1b
Traceback (most recent call last):
File "/workspace/competition_kit/lm-training/train.py", line 465, in <module>
train()
File "/workspace/competition_kit/lm-training/train.py", line 360, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2961, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 924, in __init__
self.param_swapper = param_swapper or AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
aio_op = AsyncIOBuilder().load(verbose=False)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 450, in jit_load
raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue. None
[2023-10-20 15:04:40,690] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 398
[2023-10-20 15:04:40,691] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = 1
from data-juicer.
This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.
from data-juicer.
Close this stale issue.
from data-juicer.
Related Issues (20)
- DJ-v.0.2 docker image update HOT 1
- DJ-v0.2 API page enhancement
- Video content compliance and privacy protection operators (image, text, audio)
- [Bug]: video split by duration mapper return non-exist video
- support panda's student captioner model in our captioning mapper HOT 3
- [Bug]: Video_split_by_scene_mapper create non-exist video_keys
- [Feature Request] Implement more streamlined interfaces for users seeking minimal functionality (data_juicer.op.functional) HOT 2
- Request a sample code demonstrating the use of image_captioning_from_gpt4v_mapper.py HOT 3
- Can not download the data quality classifier models. HOT 1
- alphanumeric_filter算子清洗疑问 HOT 5
- Absolute path to relative path for multi-source
- [Bug]: process on ray occur "TypeError: 'str' object cannot be interpreted as an integer" HOT 8
- filter是否支持batch处理,以及怎么设置batch_size? HOT 5
- hash calculate in ray deduplicator HOT 4
- 为什么大部分的refined recipe都是用simhash去重? HOT 3
- [Bug]: 运行tools/analyze_data.py报错,出现 KeyError: 'text' HOT 2
- [Question] Can't find evalutor.yaml on the path of `/workspace/data-juicer/demos` HOT 1
- A Compatibility Issue in Environment Installation of DJ-Sandbox HOT 1
- stopwords_filter 为什么是过滤掉小于某个阈值的样本 HOT 3
- 报”error: Unrecognized arguments: -B -S -I -c“ HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-juicer.