Comments (3)
Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first): Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 15, in <module> train() File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train trainer.train() File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step loss = self.compute_loss(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss outputs = model(**inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward return model_forward(*args, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__ return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward outputs = self.model( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward return super(GraphLlamaModel, self).forward( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward layer_outputs = self._gradient_checkpointing_func( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward output_unpad = flash_attn_unpadded_qkvpacked_func( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale, File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens, RuntimeError: CUDA error: device-side assert triggered WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM
I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:
Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors Traceback (most recent call last): File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module> train() File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train trainer.train() File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step loss = self.compute_loss(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss outputs = model(**inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply output.reraise() File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise raise exception StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward outputs = self.model( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward node_forward_out = graph_tower(g) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward device = self.parameters().__next__().device StopIteration
Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!
excuse me, have u solved this problem?
from graphgpt.
Thank you for your interest in our GraphGPT. I apologize for the delayed response due to the academic workload at the end of the semester.
Maybe you can fix this error by commentting replace_llama_attn_with_flash_attn()
in line 8 in https://github.com/HKUDS/GraphGPT/blob/main/graphgpt/train/train_mem.py. And it could be:
# Make it more memory efficient by monkey patching the LLaMA model with FlashAttn.
# Need to call this before importing transformers.
from graphgpt.train.llama_flash_attn_monkey_patch import (
replace_llama_attn_with_flash_attn,
)
# replace_llama_attn_with_flash_attn()
from graphgpt.train.train_graph import train
if __name__ == "__main__":
train()
If that doesn't work, feel free to ask me further.
Wishing you an early Merry Christmas!
from graphgpt.
Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first): Traceback (most recent call last): File "graphgpt/train/train_mem.py", line 15, in <module> train() File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train trainer.train() File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step loss = self.compute_loss(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss outputs = model(**inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward return model_forward(*args, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__ return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward outputs = self.model( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward return super(GraphLlamaModel, self).forward( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward layer_outputs = self._gradient_checkpointing_func( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward output_unpad = flash_attn_unpadded_qkvpacked_func( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale, File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens, RuntimeError: CUDA error: device-side assert triggered WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM
I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:
Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors Traceback (most recent call last): File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module> train() File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train trainer.train() File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step loss = self.compute_loss(model, inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss outputs = model(**inputs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply output.reraise() File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise raise exception StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward outputs = self.model( File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward node_forward_out = graph_tower(g) File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward device = self.parameters().__next__().device StopIteration
Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!
excuse me, have u solved this problem?
Thank you for your attention. Please refer to my reply above.
from graphgpt.
Related Issues (20)
- 关于OOM错误 HOT 3
- Could this model used for other Graph Custom data Finetuning HOT 9
- A question about training weights of embedding HOT 1
- 关于对比学习中的一些疑问 HOT 8
- How to calculate the Accuracy and Macro-f1 from the evaluation output? HOT 2
- Extract the Trained Projector HOT 6
- AttributeError: 'str' object has no attribute 'requires_grad_' HOT 7
- hugging face访问不了,能不能加一个备选的访问链接 HOT 2
- About baseline codes HOT 2
- > It is a very nice work and inspires me a lot. How do you evaluate the predictions generated from LLM? The paper claims that evaluating with Acc and F1. However it can be hard to evaluate the text sometimes. HOT 6
- Ask for the construction of datasets HOT 4
- bug in docs HOT 1
- `graphgpt_stage1_lightning` 在第一个epoch训练结束后发生nccl超时错误。 HOT 1
- The error occurred while installing the packages listed in requirements.txt. HOT 2
- 关于图编码器的问题 HOT 1
- ModuleNotFoundError: No module named 'graphgpt' HOT 2
- 这个预训练模型可以用来做图结构的下游预测任务么?有相关示例么? HOT 1
- CUDA out of memory.
- A problem occurred when I tried to run stage1.sh HOT 4
- I can not find this file: ./arxiv_ti_ab.json HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from graphgpt.