Giter VIP home page Giter VIP logo

graphgpt's Issues

About baseline codes

This is a very interesting project. Could you please provide some pretraining code for GNN-based baselines? Is the training process similar to common procedures? For example, is the last layer of the GNN the same as the number of categories, or does the GNN generate representations that are then used to train a separate logistic regression classifier?

Regarding the zero-shot process for the baseline, could you specify the exact configurations? For instance, ArXiv has 40 categories, and for Cora and PubMed, the number of categories is different. How should this discrepancy be handled? If possible, could you provide some example code?

Thank you for your response!

Requesting for the code for evaluation

It is a very nice work and inspires me a lot. How do you evaluate the predictions generated from LLM? The paper claims that evaluating with Acc and F1. However it can be hard to evaluate the text sometimes.

bug in docs

conda env create -n graphgpt python=3.8
I believe should be:
conda create -n graphgpt python=3.8

The error occurred while installing the packages listed in requirements.txt.

Ask for the construction of datasets

Hi, I'm interested in this work but I'm still not sure about the dataset construction so I would like to ask some questions about the dataset for the experimental phase.
I would like to understand how the data forms of id, conversations and graph are constructed when constructing the dataset for stage1, stage2 and evaluation. I found that there is a difference in the data form used for stage1, stage2 and evaluation, is it possible to interpret the data form for the data samples of these three stages?

For example the following three samples.

  1. the sample from Jiabin99/GraphGPT-eval-instruction (evaluation)
    image

  2. the sample from Jiabin99/Arxiv-PubMed-mix-NC-LP (stage2)
    image

  3. the sample from Jiabin99/graph-matching (stage1)
    image

Why are the baseline results so low?

hey bro!!Thank you very much for your work. I was greatly inspired after reading it. But I have a small question about the baseline result that I would like to ask you.
I noticed you cited this article:https://arxiv.org/pdf/2305.19523.pdf. But the result of Arxiv is different from you. I see that you follow the same public division, but the GCN in this article reached 0.7182 and the SAGE reached 0.7171, while in your paper they were only 0.5267 and 0.5480 respectively. Why?
92511698673814_ pic

Missing the code for structure-text grounding on many graphs

Hi,

Thank you very much for your work. Could please provide the pre-training code for structure-text grounding on large graphs such as Arxiv, as we feel only providing the checkpoint is hard to reproduce the experimental results on large graphs.

对比学习的一个细节问题

在text-graph grounding代码中,有如下函数:
def cal_cl_loss(s_features, t_features, labels):
logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)).exp()
logits = logit_scale * s_features @ t_features.t()
loss_i = F.cross_entropy(logits, labels)
loss_t = F.cross_entropy(logits.T, labels)
ret_loss = (loss_i + loss_t) / 2
return ret_loss
然而,在反向传播时仅仅优化了model中的参数,这个logit_scale似乎不会被训练,这与G2P2或CLIP开源实现中的对比学习方式有一些区别。请问此处的参数会被训练嘛?

Extract the Trained Projector

请问pytorch_model.bin.index.json是一阶段训练后的output中所应该包含的结果吗,在进行Extract the Trained Projector时,extract_projector.sh文件运行出现了错误No such file or directory: './checkpoints/stage_1/pytorch_model.bin.index.json'。
我的一阶段训练output为:
image

`graphgpt_stage1_lightning` 在第一个epoch训练结束后发生nccl超时错误。

我在使用您提供的轻量化模型第一阶段训练的时候,在第一个epoch训练结束后,发生NCCL超时的错误,想请问一下,有什么办法解决。我是在八张4090下进行训练的,训练和测试的batchsize_per_device均为2.

[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out.
 Due to the asynchronous nature of CUDA kernels, subsequent GPU operations 
might run on corrupted/incomplete data.                                    
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the
 entire process down.                
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with 
exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(
SeqNum=778778, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000
) ran for 1800413 milliseconds before timing out.                          
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1
] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=778778, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800413 
milliseconds before timing out.   

No module named 'fastchat'

I encountered an error No module named 'fastchat' while running the train_mem.py
some details:
(from fastchat.conversation) in model_adapter.py

Is the fastchat the folder name of a project ?

关于找不到config.json文件

2023-11-09 16:41:33,050 INFO worker.py:1673 -- Started a local Ray instance.
(eval_model pid=11217) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(eval_model pid=11217) start loading
Traceback (most recent call last):
File "./run_graphgpt.py", line 240, in
run_eval(args, args.num_gpus)
File "./run_graphgpt.py", line 94, in run_eval
ans_jsons.extend(ray.get(ans_handle))
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::eval_model() (pid=11217, ip=172.27.37.124)
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "./run_graphgpt.py", line 116, in eval_model
model = GraphLlamaForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, use_cache=True, low_cpu_mem_usage=True).cuda()
File "/home/fry/.conda/envs/graphgpt/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3085, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/fry/桌面/GraphGPT-main/graphgpt/model/GraphLlama.py", line 284, in init
self.model = GraphLlamaModel(config)
File "/home/fry/桌面/GraphGPT-main/graphgpt/model/GraphLlama.py", line 104, in init
clip_graph, args= load_model_pretrained(CLIP, config.pretrain_graph_model_path)
File "/home/fry/桌面/GraphGPT-main/graphgpt/model/GraphLlama.py", line 55, in load_model_pretrained
assert osp.exists(osp.join(pretrain_model_path, 'config.json')), 'config.json missing'
AssertionError: config.json missing
(eval_model pid=11217) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(eval_model pid=11217) finish loading
(eval_model pid=11217) start loading
我下载的是您的checkpoints

About text feature

This is a very interesting job! One thing I am curious about is whether to use the [CLS] token embedding obtained through BERT processing as a feature or to use the last hidden states as features.

关于图编码器的问题

在GraphGPT论文中,关于图编码器用的是graph transformer,给出的引用是[62] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim.
2019. Graph transformer networks. In NeurIPS, Vol. 32(第[61]号引用也是这个)。 这个图网络叫GTN,不是一个典型的transformer架构的网络。然而,本仓库的开源代码中却使用了一个带有位置编码、图结构和MHA的GNN(类似于github.com/HKUDS/GFormer),引用是不是写错了。

Performance of LLM baselines too low

Excuse me, I would like to know what are the baseline model experimental settings for LLM? Will the task template also include the title and abstract of the paper?
Here I use vicuna-7B-v1.5 for pubmed data prediction, and the ACC obtained is 0.86, which is much higher than the result in the paper.
image

A question about training weights of embedding

I have a question about the training weights of embedding. I used my own datasets to process stage 1 (which includes tuning the embedding weights of new graph tokens, e.g. DEFAULT_GRAPH_TOKEN = ""), but the weights became Nan instantly, I don't know why. Thanks for your patience.

ModuleNotFoundError: No module named 'graphgpt'

ModuleNotFoundError: No module named 'graphgpt' when runing sh ./scripts/tune_script/graphgpt_stage1.sh

Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError:     No module named 'graphgpt'from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/k/kle3/DIAL-Lab/GraphGPT/graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
[2024-02-13 15:35:41,869] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1123051) of binary: /afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/bin/python
Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 810, in <module>
    main()
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/afs/crc.nd.edu/user/k/kle3/.conda/envs/GraphGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-13_15:35:41
  host      : qa-a100-002.crc.nd.edu
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1123052)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-02-13_15:35:41
  host      : qa-a100-002.crc.nd.edu
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1123053)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-02-13_15:35:41
  host      : qa-a100-002.crc.nd.edu
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1123054)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-13_15:35:41
  host      : qa-a100-002.crc.nd.edu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1123051)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

关于对比学习中的一些疑问

作者您好!
有幸了解到您的工作,对比论文和代码产生的一些疑惑:
1)cal_cl_loss部分
image
论文中有transformation functions, 但是实际上logit_scale对应exp(𝜏),那么g_i^(1)和g_i^(2)对应什么呢?
此处我的理解是:
在定义encode_text的时候,有一行是x = x @ self.text_projection,这里的text_projection对应的是text的transformation
functions么?但是为什么graph_text没有呢?
2)对齐的疑问
graph encoder的空间是一个128维的空间,encode_text确实一个512的空间,通过text_projection投影的128的空间。那么最终对齐的含义是什么?是说让graph encoder的空间和text encoder的一个投影空间对齐么?
感谢您抽空回答,谢谢!

A problem occurred when I tried to run stage1.sh

This part of the code "Pretra_gnn=./clip_gt_arxiv " is not included in your project .When I was running the code, I found that this part of the error should be related to clip_gt_arxiv_pub.pkl, and I also found that there is a corresponding reading method in the graphgpt-main/graphgpt/model/graphllama.py file. Do I need to call this code to reproduce the paper? And how I can run it?

> It is a very nice work and inspires me a lot. How do you evaluate the predictions generated from LLM? The paper claims that evaluating with Acc and F1. However it can be hard to evaluate the text sometimes.

          > It is a very nice work and inspires me a lot. How do you evaluate the predictions generated from LLM? The paper claims that evaluating with Acc and F1. However it can be hard to evaluate the text sometimes.

Thank you for your interest in our GraphGPT. I apologize for the delayed response due to the academic workload at the end of the semester.
The evaluation code will release by the end of this week!
Wishing you an early Merry Christmas!

Originally posted by @tjb-tech in #19 (comment)

关于文本特征

It is really a very original and amazing work! I have a few questions to consult. How to you transoform the 768 / 1024 dimension BERT embedding to 128 dimension embedding for original node embedding? Thanks for your replying. Happy Christmas!

Error in Self-Supervised Instruction Tuning

Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:

../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward
    return super(GraphLlamaModel, self).forward(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward
    output_unpad = flash_attn_unpadded_qkvpacked_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func
    return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward
    qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens,
RuntimeError: CUDA error: device-side assert triggered
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM

I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:

Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward
    node_forward_out = graph_tower(g)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward
    device = self.parameters().__next__().device
StopIteration

Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!

Why does the paper say that only the parameters of the alignment projector are optimized, but in fact the parameters tuned also include the input embedding of the language model?

I found out from the number of tuned parameters in the paper and from the code that the tuned parameters also contain the input embeddings of the language model, which confused me.
Also, could input embedding changes affect llama's own abilities and cause catastrophic forgetting? Thanks to the author for the great work, please help with the answer.

CUDA out of memory

在运行graphgpt_stage1.sh 时,我出现了这个错误。在您的最新更新中,我看到您提到可以使用两张3090显卡进行复现。我的配置是RTX 4090 * 4卡,但仍然出现了cuda内存不够的情况,希望您能帮助我解答这个问题,谢谢。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB.
GPU 0 has a total capacty of 23.65 GiB of which 54.06 MiB is free. Process 835399 has 23.59 GiB memory in use.
Of the allocated memory 23.21 GiB is allocated by PyTorch, and 4.64 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

solved now

bin/python: can't open file 'graphgpt/train/train_mem.py': [Errno 2] No such file or directory

模型节点识别和关系推理能力

请问这个模型是否具备以下能力:
1、根据论文标题或摘要中的部分文本查询到对应论文的详细信息,比如完整标题、摘要、分类
2、在不给定上下文的情况下,查询一篇论文引用的其他论文

Would you share your code for generating prompting files?

Hi,
Thank you for your inspiring work.
Would you share your code for generating those provided prompting files? I am trying to reproduce your work on different datasets, and I believe it will save me much trouble if you provided so.

AttributeError: 'str' object has no attribute 'requires_grad_'

Traceback (most recent call last):
  File "/home/shaozhihui/szh/GraphGPT/graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/shaozhihui/szh/GraphGPT/graphgpt/train/train_graph.py", line 863, in train
    model_graph_dict = model.get_model().initialize_graph_modules(
  File "/home/shaozhihui/szh/GraphGPT/graphgpt/model/GraphLlama.py", line 148, in initialize_graph_modules
    graph_tower.requires_grad_(False)
AttributeError: 'str' object has no attribute 'requires_grad_'

我认为可能是vicuna的config.json中的问题:

"graph_hidden_size": 128, 
"pretrain_graph_model_path": "/home/shaozhihui/szh/GraphGPT/graph_transformer/"

我的目录结构如下:
image

这是我的运行命令

model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/stage_1/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=./graph_transformer/clip_gt_arxiv_pub.pkl
output_model=./checkpoints/stage_1

wandb offline
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 False \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

有人能帮帮我吗,谢谢

关于OOM错误

您好, 请问graphgpt_stage1.sh对应的是论文中如下图的实验部分的Stage-1-freeze还是Stage-1-tune呢?
1
我使用4张A100,在训练时batch_size==2时出现了OOM错误,batch_size==1时可以正常训练,想请问可能是什么原因呢?

关于运行graphgpt_eval.sh找不到模块的问题

import torch
import os
import sys
sys.path.append(u'/home/fry/桌面/GraphGPT/graphgpt')
from conversation import conv_templates, SeparatorStyle
from transformers import CLIPVisionModel, CLIPImageProcessor, StoppingCriteria
from model import *
为了找到上级目录下的模块,我用了sys.path,在修改完引用之后,运行之后又出现了其他一些找不到模块的问题,因为文件较多有点复杂不知道该怎么搞,之前把graphgpt_eval.sh以及run_graphgpt.py放在根目录里是可以运行的,但是运行一段时间后还是会有找不到模块的问题,这些文件有点复杂位置都很分散,现在不知道该如何解决了,恳请您的指点!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.