✨Welcome to the Data Intelligence Lab @ HKU!✨
🚀 Our Lab is Passionately Dedicated to Exploring the Forefront of the Data Science & AI 👨💻
[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
Home Page: https://arxiv.org/abs/2310.13023
License: Apache License 2.0
Hello, author, I try to run training stage one's code and I found the original config vicuna-7b-v1.5 lack of some config item needed in the code, such as, pretrain_graph_model_path
. Would you provide the modified version?
graph_matching.json
The above page opens with a 404 error. Please provide a correct url. Thansk.
environment error
This is a very interesting project. Could you please provide some pretraining code for GNN-based baselines? Is the training process similar to common procedures? For example, is the last layer of the GNN the same as the number of categories, or does the GNN generate representations that are then used to train a separate logistic regression classifier?
Regarding the zero-shot process for the baseline, could you specify the exact configurations? For instance, ArXiv has 40 categories, and for Cora and PubMed, the number of categories is different. How should this discrepancy be handled? If possible, could you provide some example code?
Thank you for your response!
Hi,
Thank you very much for your work. Could please provide the pre-training code for structure-text grounding on large graphs such as Arxiv, as we feel only providing the checkpoint is hard to reproduce the experimental results on large graphs.
Sorry to bother you...
The url of graph transformer seems like the same as GraphGPT.
You mean [https://github.com/seongjunyun/Graph_Transformer_Networks] ?
memory 40G*2
hey bro!!Thank you very much for your work. I was greatly inspired after reading it. But I have a small question about the baseline result that I would like to ask you.
I noticed you cited this article:https://arxiv.org/pdf/2305.19523.pdf. But the result of Arxiv is different from you. I see that you follow the same public division, but the GCN in this article reached 0.7182 and the SAGE reached 0.7171, while in your paper they were only 0.5267 and 0.5480 respectively. Why?
It's a wonderful research, but I have a question: all the training data seems to be collected from the research paper domain. Therefore, if it lacks the capability to zero-shot infer graphs from other domains, such as medicine or communication ?
ModuleNotFoundError: No module named 'graphgpt' when runing sh ./scripts/tune_script/graphgpt_stage1.sh
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
[2024-02-13 15:35:41,869] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1123051) of binary: /afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/bin/python
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 810, in <module>
main()
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-02-13_15:35:41
host : qa-a100-002.crc.nd.edu
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1123052)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-02-13_15:35:41
host : qa-a100-002.crc.nd.edu
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1123053)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-02-13_15:35:41
host : qa-a100-002.crc.nd.edu
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 1123054)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-13_15:35:41
host : qa-a100-002.crc.nd.edu
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1123051)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Hi,
Thank you for your inspiring work.
Would you share your code for generating those provided prompting files? I am trying to reproduce your work on different datasets, and I believe it will save me much trouble if you provided so.
Your work is very impressive. Aligning GNN with LM can help it fit with LLaMA. But I wonder why align with a transformer training from scratch? Is this better than align with LLaMA?
Is there a script provided to calculate the acc and f1 score as is the paper in the evaluation module?
After running the evaluation script, only can we generate the output json file.
There seems to be a lack of an ACC calculating process.
2023-11-09 16:41:33,050 INFO worker.py:1673 -- Started a local Ray instance.
(eval_model pid=11217) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(eval_model pid=11217) start loading
Traceback (most recent call last):
File "./run_graphgpt.py", line 240, in
run_eval(args, args.num_gpus)
File "./run_graphgpt.py", line 94, in run_eval
ans_jsons.extend(ray.get(ans_handle))
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::eval_model() (pid=11217, ip=172.27.37.124)
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "./run_graphgpt.py", line 116, in eval_model
model = GraphLlamaForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, use_cache=True, low_cpu_mem_usage=True).cuda()
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3085, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/fry/桌面/GraphGPT-main/graphgpt/model/GraphLlama.py", line 284, in init
self.model = GraphLlamaModel(config)
File "/home/fry/桌面/GraphGPT-main/graphgpt/model/GraphLlama.py", line 104, in init
clip_graph, args= load_model_pretrained(CLIP, config.pretrain_graph_model_path)
File "/home/fry/桌面/GraphGPT-main/graphgpt/model/GraphLlama.py", line 55, in load_model_pretrained
assert osp.exists(osp.join(pretrain_model_path, 'config.json')), 'config.json missing'
AssertionError: config.json missing
(eval_model pid=11217) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(eval_model pid=11217) finish loading
(eval_model pid=11217) start loading
我下载的是您的checkpoints
在运行graphgpt_stage1.sh 时,我出现了这个错误。在您的最新更新中,我看到您提到可以使用两张3090显卡进行复现。我的配置是RTX 4090 * 4卡,但仍然出现了cuda内存不够的情况,希望您能帮助我解答这个问题,谢谢。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB.
GPU 0 has a total capacty of 23.65 GiB of which 54.06 MiB is free. Process 835399 has 23.59 GiB memory in use.
Of the allocated memory 23.21 GiB is allocated by PyTorch, and 4.64 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have a question about the training weights of embedding. I used my own datasets to process stage 1 (which includes tuning the embedding weights of new graph tokens, e.g. DEFAULT_GRAPH_TOKEN = ""), but the weights became Nan instantly, I don't know why. Thanks for your patience.
Hi, I'm interested in this work but I'm still not sure about the dataset construction so I would like to ask some questions about the dataset for the experimental phase.
I would like to understand how the data forms of id, conversations and graph are constructed when constructing the dataset for stage1, stage2 and evaluation. I found that there is a difference in the data form used for stage1, stage2 and evaluation, is it possible to interpret the data form for the data samples of these three stages?
For example the following three samples.
I encountered an error No module named 'fastchat' while running the train_mem.py
some details:
(from fastchat.conversation) in model_adapter.py
Is the fastchat the folder name of a project ?
It is really a very original and amazing work! I have a few questions to consult. How to you transoform the 768 / 1024 dimension BERT embedding to 128 dimension embedding for original node embedding? Thanks for your replying. Happy Christmas!
Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 15, in <module>
train()
File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
trainer.train()
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(**inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
outputs = self.model(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward
return super(GraphLlamaModel, self).forward(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward
output_unpad = flash_attn_unpadded_qkvpacked_func(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func
return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale,
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward
qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens,
RuntimeError: CUDA error: device-side assert triggered
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM
I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:
Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module>
train()
File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
trainer.train()
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(**inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
outputs = self.model(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward
node_forward_out = graph_tower(g)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward
device = self.parameters().__next__().device
StopIteration
Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!
作者您好!
有幸了解到您的工作,对比论文和代码产生的一些疑惑:
1)cal_cl_loss部分
论文中有transformation functions, 但是实际上logit_scale对应exp(𝜏),那么g_i^(1)和g_i^(2)对应什么呢?
此处我的理解是:
在定义encode_text的时候,有一行是x = x @ self.text_projection,这里的text_projection对应的是text的transformation
functions么?但是为什么graph_text没有呢?
2)对齐的疑问
graph encoder的空间是一个128维的空间,encode_text确实一个512的空间,通过text_projection投影的128的空间。那么最终对齐的含义是什么?是说让graph encoder的空间和text encoder的一个投影空间对齐么?
感谢您抽空回答,谢谢!
how to solve this error ,do you have any experience of this ,thanks
在GraphGPT论文中,关于图编码器用的是graph transformer,给出的引用是[62] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim.
2019. Graph transformer networks. In NeurIPS, Vol. 32(第[61]号引用也是这个)。 这个图网络叫GTN,不是一个典型的transformer架构的网络。然而,本仓库的开源代码中却使用了一个带有位置编码、图结构和MHA的GNN(类似于github.com/HKUDS/GFormer),引用是不是写错了。
This is a very interesting job! One thing I am curious about is whether to use the [CLS] token embedding obtained through BERT processing as a feature or to use the last hidden states as features.
import torch
import os
import sys
sys.path.append(u'/home/fry/桌面/GraphGPT/graphgpt')
from conversation import conv_templates, SeparatorStyle
from transformers import CLIPVisionModel, CLIPImageProcessor, StoppingCriteria
from model import *
为了找到上级目录下的模块,我用了sys.path,在修改完引用之后,运行之后又出现了其他一些找不到模块的问题,因为文件较多有点复杂不知道该怎么搞,之前把graphgpt_eval.sh以及run_graphgpt.py放在根目录里是可以运行的,但是运行一段时间后还是会有找不到模块的问题,这些文件有点复杂位置都很分散,现在不知道该如何解决了,恳请您的指点!
It is a very nice work and inspires me a lot. Do you have plan to release the code for the structure-text grounding part ?
这个预训练模型可以用来做图结构的下游预测任务么?有相关示例么?
在text-graph grounding代码中,有如下函数:
def cal_cl_loss(s_features, t_features, labels):
logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)).exp()
logits = logit_scale * s_features @ t_features.t()
loss_i = F.cross_entropy(logits, labels)
loss_t = F.cross_entropy(logits.T, labels)
ret_loss = (loss_i + loss_t) / 2
return ret_loss
然而,在反向传播时仅仅优化了model中的参数,这个logit_scale似乎不会被训练,这与G2P2或CLIP开源实现中的对比学习方式有一些区别。请问此处的参数会被训练嘛?
请问这个模型是否具备以下能力:
1、根据论文标题或摘要中的部分文本查询到对应论文的详细信息,比如完整标题、摘要、分类
2、在不给定上下文的情况下,查询一篇论文引用的其他论文
Traceback (most recent call last):
File "/home/shaozhihui/szh/GraphGPT/graphgpt/train/train_mem.py", line 15, in <module>
train()
File "/home/shaozhihui/szh/GraphGPT/graphgpt/train/train_graph.py", line 863, in train
model_graph_dict = model.get_model().initialize_graph_modules(
File "/home/shaozhihui/szh/GraphGPT/graphgpt/model/GraphLlama.py", line 148, in initialize_graph_modules
graph_tower.requires_grad_(False)
AttributeError: 'str' object has no attribute 'requires_grad_'
我认为可能是vicuna的config.json中的问题:
"graph_hidden_size": 128,
"pretrain_graph_model_path": "/home/shaozhihui/szh/GraphGPT/graph_transformer/"
这是我的运行命令
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/stage_1/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=./graph_transformer/clip_gt_arxiv_pub.pkl
output_model=./checkpoints/stage_1
wandb offline
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 --master_port=20001 \
graphgpt/train/train_mem.py \
--model_name_or_path ${model_path} \
--version v1 \
--data_path ${instruct_ds} \
--graph_content ./arxiv_ti_ab.json \
--graph_data_path ${graph_data_path} \
--graph_tower ${pretra_gnn} \
--tune_graph_mlp_adapter True \
--graph_select_layer -2 \
--use_graph_start_end \
--bf16 False \
--output_dir ${output_model} \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
有人能帮帮我吗,谢谢
It is a very nice work and inspires me a lot. How do you evaluate the predictions generated from LLM? The paper claims that evaluating with Acc and F1. However it can be hard to evaluate the text sometimes.
where is this file ??QAQ
conda env create -n graphgpt python=3.8
I believe should be:
conda create -n graphgpt python=3.8
> It is a very nice work and inspires me a lot. How do you evaluate the predictions generated from LLM? The paper claims that evaluating with Acc and F1. However it can be hard to evaluate the text sometimes.
Thank you for your interest in our GraphGPT. I apologize for the delayed response due to the academic workload at the end of the semester.
The evaluation code will release by the end of this week!
Wishing you an early Merry Christmas!
Originally posted by @tjb-tech in #19 (comment)
An error occurs on the line 801 of train_graph.py, that is 'pretrain_graph_model_path' is not defined.
This part of the code "Pretra_gnn=./clip_gt_arxiv " is not included in your project .When I was running the code, I found that this part of the error should be related to clip_gt_arxiv_pub.pkl, and I also found that there is a corresponding reading method in the graphgpt-main/graphgpt/model/graphllama.py file. Do I need to call this code to reproduce the paper? And how I can run it?
If so, how should I organize the graph and place the data file where
Reading the paper, it seems that the parameters of the LLM are always frozen
I found out from the number of tuned parameters in the paper and from the code that the tuned parameters also contain the input embeddings of the language model, which confused me.
Also, could input embedding changes affect llama's own abilities and cause catastrophic forgetting? Thanks to the author for the great work, please help with the answer.
bin/python: can't open file 'graphgpt/train/train_mem.py': [Errno 2] No such file or directory
can you give me some advice ,thanks
我在使用您提供的轻量化模型第一阶段训练的时候,在第一个epoch训练结束后,发生NCCL超时的错误,想请问一下,有什么办法解决。我是在八张4090下进行训练的,训练和测试的batchsize_per_device均为2.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out.
Due to the asynchronous nature of CUDA kernels, subsequent GPU operations
might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the
entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with
exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(
SeqNum=778778, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000
) ran for 1800413 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1
] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=778778, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800413
milliseconds before timing out.
My mirror channels is as follows.
channels:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.