Giter VIP home page Giter VIP logo

gcc's Introduction



License Code Style


GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Original implementation for paper GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training.

GCC is a contrastive learning framework that implements unsupervised structural graph representation pre-training and achieves state-of-the-art on 10 datasets on 3 graph mining tasks.

Installation

Requirements

Quick Start

Pretraining

Pre-training datasets

python scripts/download.py --url https://drive.google.com/open?id=1JCHm39rf7HAJSp-1755wa32ToHCn2Twz --path data --fname small.bin
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/b37eed70207c468ba367/?dl=1 --path data --fname small.bin

E2E

Pretrain E2E with K = 255:

bash scripts/pretrain.sh <gpu> --batch-size 256

MoCo

Pretrain MoCo with K = 16384; m = 0.999:

bash scripts/pretrain.sh <gpu> --moco --nce-k 16384

Download Pretrained Models

Instead of pretraining from scratch, you can download our pretrained models.

python scripts/download.py --url https://drive.google.com/open?id=1lYW_idy9PwSdPEC7j9IH5I5Hc7Qv-22- --path saved --fname pretrained.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/cabec37002a9446d9b20/?dl=1 --path saved --fname pretrained.tar.gz

Downstream Tasks

Downstream datasets

python scripts/download.py --url https://drive.google.com/open?id=12kmPV3XjVufxbIVNx5BQr-CFM9SmaFvM --path data --fname downstream.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/2535437e896c4b73b6bb/?dl=1 --path data --fname downstream.tar.gz

Generate embeddings on multiple datasets with

bash scripts/generate.sh <gpu> <load_path> <dataset_1> <dataset_2> ...

For example:

bash scripts/generate.sh 0 saved/Pretrain_moco_True_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_32_hid_64_samples_2000_nce_t_0.07_nce_k_16384_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999/current.pth usa_airport kdd imdb-binary

Node Classification

Unsupervised (Table 2 freeze)

Run baselines on multiple datasets with bash scripts/node_classification/baseline.sh <hidden_size> <baseline:prone/graphwave> usa_airport h-index.

Evaluate GCC on multiple datasets:

bash scripts/generate.sh <gpu> <load_path> usa_airport h-index
bash scripts/node_classification/ours.sh <load_path> <hidden_size> usa_airport h-index
Supervised (Table 2 full)

Finetune GCC on multiple datasets:

bash scripts/finetune.sh <load_path> <gpu> usa_airport

Note this finetunes the whole network and will take much longer than the freezed experiments above.

Graph Classification

Unsupervised (Table 3 freeze)
bash scripts/generate.sh <gpu> <load_path> imdb-binary imdb-multi collab rdt-b rdt-5k
bash scripts/graph_classification/ours.sh <load_path> <hidden_size> imdb-binary imdb-multi collab rdt-b rdt-5k
Supervised (Table 3 full)
bash scripts/finetune.sh <load_path> <gpu> imdb-binary

Similarity Search (Table 4)

Run baseline (graphwave) on multiple datasets with bash scripts/similarity_search/baseline.sh <hidden_size> graphwave kdd_icdm sigir_cikm sigmod_icde.

Run GCC:

bash scripts/generate.sh <gpu> <load_path> kdd icdm sigir cikm sigmod icde
bash scripts/similarity_search/ours.sh <load_path> <hidden_size> kdd_icdm sigir_cikm sigmod_icde

❗ Common Issues

"XXX file not found" when running pretraining/downstream tasks.
Please make sure you've downloaded the pretraining dataset or downstream task datasets according to GETTING_STARTED.md.
Server crashes/hangs after launching pretraining experiments.
In addition to GPU, our pretraining stage requires a lot of computation resources, including CPU and RAM. If this happens, it usually means the CPU/RAM is exhausted on your machine. You can decrease `--num-workers` (number of dataloaders using CPU) and `--num-copies` (number of datasets copies residing in RAM). With the lowest profile, try `--num-workers 1 --num-copies 1`.

If this still fails, please upgrade your machine :). In the meanwhile, you can still download our pretrained model and evaluate it on downstream tasks.

Having difficulty installing RDKit.
See the P.S. section in [this](#12 (comment)) post.

Citing GCC

If you use GCC in your research or wish to refer to the baseline results, please use the following BibTeX.

@article{qiu2020gcc,
  title={GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training},
  author={Qiu, Jiezhong and Chen, Qibin and Dong, Yuxiao and Zhang, Jing and Yang, Hongxia and Ding, Ming and Wang, Kuansan and Tang, Jie},
  journal={arXiv preprint arXiv:2006.09963},
  year={2020}
}

Acknowledgements

Part of this code is inspired by Yonglong Tian et al.'s CMC: Contrastive Multiview Coding.

gcc's People

Contributors

ericdongyx avatar hobbitlong avatar meyerjo avatar qibinc avatar xptree avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gcc's Issues

it seems dgl-0.4.1 cannot work properly

when running the script, will get the following error:

munmap_chunk(): invalid pointer
scripts/pretrain.sh: line 10: 17039 Aborted (core dumped) python train.py --exp Pretrain --model-path saved --tb-path tensorboard --gpu $gpu $ARGS

update dgl to 0.4.3 fix it.

Please help resolve the problem when the code was running on CUDA11.1 , DGL 0.7,

dgl._ffi.base.DGLError: [15:08:17] /opt/dgl/include/dgl/packed_func_ext.h:117: Check failed: ObjectTypeChecker::Check(sptr.get()): Expected type graph.Graph but get graph.HeteroGraph
Stack trace:

how can I update this code to dgl7.x when in cuda11?

the dgl.contrib.sampling.random_walk_with_restart and dgl.contrib.sampling.random_walk cannt work in dgl-cu11, it need to replaced by dgl.sampling.random_walk , but the parameters was different . how can I update this code to dgl7.x when in cuda11.

my email: [email protected]

Questions about pretraining subgraphs

Hi,

May I ask some questions about the pretraining subgraphs?

  1. Why do you apply a (** 0.75 ) operation to the individual node degrees? What is the benefit of this?

    degrees = torch.cat([g.in_degrees().double() ** 0.75 for g in self.graphs])

  2. Here the "replace" option is set to True.

    self.length, size=self.num_samples, replace=True, p=prob.numpy()

    I believe it is likely that some nodes would be sampled twice or more times, which might harm the contrastive training process. For example, if node v is sampled twice, then it would have two query-key pairs: (g_1, g_2) and (g_3, g_4) for v. In contrastive training (g_1, g_2) is regarded as a positive sample, while (g_1, g_3) is considered as negative, though all the four subgraphs, i.e., g_1 to g_4 are sampled from the ego-graph of node v. Would it be better if set this option to False? Or did I misunderstand anything about the contrastive training process?

  3. Why there is a max(self.rw_hops, ....) operation? What is the disadvantage of just using the preset self.rw_hops for each of the nodes? Moreover, why is there also a (** 0.75) operation?

    max_nodes_per_seed = max(

Thank you very much!

x2dgl.py

能否提供x2dgl.py中提到的预训练的五个数据集,即kdd17,请问是否可以用其他的数据集然后利用x2dgl.py这个代码制作.bin文件

Running experiments completely on CPU

Hi, thanks for the inspiring work of GCC.

I wonder if there is a way to run GCC pretrain on CPU-only. As it appears to me, there is no easy way (such as specifying an option in the arguments) to do this.

P.S. Installation of RDKit is really painful for servers inside mainland china. Through checking the code, I found that there is no need to install RDKit, all one have to do is to copy the GAT and GCN layer code of DGL to the models/gat.py and models/gcn.py respectively. Please kindly correct me if I was wrong.

Any help is highly appreciated : )

torch.util.data.IterableDataset

Hi:
In graph_dataset.py , class LoadBalanceGraphDataset(torch.utils.data.IterableDataset), self.num_samples default is 2000. I want to learn what's the relation between this variable and tensor.batch_size.In my own experiment, my class Dataset belong to IterableDataset, when self.num_samples is not the multiple of train_loader.batch_size, it will be wring.

Possibly redundant BatchNorm layer?

@xptree Thank you for the code. It seems that there are two cascaded BatchNorm layers in each GIN layer. I am wondering whether one of them is redundant.

Specifically,
In the UnsupervisedGIN class, BNs are instantiated (use_selayer is False, as in the code):
image
and called during forward:
image

Meanwhile,
In the ApplyNodeFunc class, a BN is again instantiated and called during forward (use_selayer is False, as in the code):
image

So in each layer, there are two cascaded BNs, between which there is only a ReLU activation.

As a novice in GNN, I did not see such implementation (cascaded BNs) elsewhere. Could you please explain why you did this? Does this implementation lead to better performance than keeping only one BN in each layer?

Thank you!

How to make data?

Hi:
I want to learn how to make experiential data like downloaded 'dgl.bin' file?

finetune

Hi:
When I use pre-trained model finetune downstream graph_classification dataset RDT-B and RDT-M, I meet one problem. In deep learning finetune process, RDT-B and RDT-M, the train accuracy is very high, sometimes near to 1 while test accuracy is very low, it should be overfitting problem. Have you met this before and how did you deal with it if it does?
Thanks.

Can't do node_classification tasks on panther datasets

When I use cikm dataset in panther directory do downstream node_classification task, in data_utils class SSSingleDataset, self.data = Data(x=None, edge_index = egde_index,y = None), while in graph_dataset.py line 434 self.num_classes = self.data.y.shape[1], this self.data is Data object, self.data.y doesn't exist so it runs wrong.

About downstream datasets

Hello, I want to run code on the Cora and Citeseer dataset, but I found no downstream dataset named it.
So, could you please provide the code for generating downstream datasets? or would you mind offering me the Cora and Citeseer files you have generated?
Thanks a million!

【DGLError】dgl._ffi.base.DGLError:Check failed: fs: Filename is invalid

Run pre-training example fails.

When i run wit this code

There are some troubles with it.

I don`t have any idea with what happens. Could you help me? thanks a lot.

paper`s name:GCC: Graph Contrastive Coding for Graph Neural NetworkPre-Training
paper:https://arxiv.org/abs/2006.09963
code:https://github.com/THUDM/GCC

Error

(GCC01) chloe@chloe-MS-7A74:~/Documents/00 Work/02 Xovee/01 Code/GCC$ bash scripts/pretrain.sh 0 --batch-size 256
Using backend: pytorch
Namespace(alpha=0.999, aug='1st', batch_size=256, beta1=0.9, beta2=0.999, clip_norm=1.0, cv=False, dataset='dgl', degree_embedding_size=16, epochs=100, exp='Pretrain', finetune=False, fold_idx=0, freq_embedding_size=16, gpu=0, hidden_size=64, learning_rate=0.005, load_path=None, lr_decay_epochs=[120, 160, 200], lr_decay_rate=0.0, max_degree=512, max_edge_freq=16, max_node_freq=16, moco=False, model='gin', model_folder='saved/Pretrain_moco_False_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_256_hid_64_samples_2000_nce_t_0.07_nce_k_32_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999', model_name='Pretrain_moco_False_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_256_hid_64_samples_2000_nce_t_0.07_nce_k_32_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999', model_path='saved', momentum=0.9, nce_k=32, nce_t=0.07, norm=True, num_copies=6, num_layer=5, num_samples=2000, num_workers=12, optimizer='adam', positional_embedding_size=32, print_freq=10, readout='avg', restart_prob=0.8, resume='', rw_hops=256, save_freq=1, seed=0, set2set_iter=6, set2set_lstm_layer=3, subgraph_size=128, tb_folder='tensorboard/Pretrain_moco_False_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_256_hid_64_samples_2000_nce_t_0.07_nce_k_32_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999', tb_freq=250, tb_path='tensorboard', weight_decay=1e-05)
Use GPU: 0 for training
setting random seeds
before construct dataset 6.249996185302734
Traceback (most recent call last):
File "train.py", line 818, in
main(args)
File "train.py", line 555, in main
num_copies=args.num_copies
File "/home/chloe/Documents/00 Work/02 Xovee/01 Code/GCC/gcc/datasets/graph_dataset.py", line 58, in init
graph_sizes = dgl.data.utils.load_labels(dgl_graphs_file)[
File "/home/chloe/anaconda3/envs/GCC01/lib/python3.7/site-packages/dgl/data/graph_serialize.py", line 172, in load_labels
metadata = _CAPI_DGLLoadGraphs(filename, [], True)
File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.call
File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [21:41:01] /opt/dgl/src/graph/graph_serialize.cc:193: Check failed: fs: Filename is invalid
Stack trace:
[bt] (0) /home/chloe/anaconda3/envs/GCC01/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x22) [0x7fbe9e1ec782]
[bt] (1) /home/chloe/anaconda3/envs/GCC01/lib/python3.7/site-packages/dgl/libdgl.so(dgl::serialize::LoadDGLGraphs(std::string const&, std::vector<unsigned long, std::allocator >, bool)+0xe7c) [0x7fbe9e859a5c]
[bt] (2) /home/chloe/anaconda3/envs/GCC01/lib/python3.7/site-packages/dgl/libdgl.so(+0xd1f0eb) [0x7fbe9e85a0eb]
[bt] (3) /home/chloe/anaconda3/envs/GCC01/lib/python3.7/site-packages/dgl/libdgl.so(DGLFuncCall+0x52) [0x7fbe9e7f46e2]
[bt] (4) /home/chloe/anaconda3/envs/GCC01/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x19cdb) [0x7fbef63a5cdb]
[bt] (5) /home/chloe/anaconda3/envs/GCC01/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x1a25b) [0x7fbef63a625b]
[bt] (6) python(_PyObject_FastCallKeywords+0x48b) [0x55db603c900b]
[bt] (7) python(_PyEval_EvalFrameDefault+0x49b6) [0x55db6042d186]
[bt] (8) python(_PyFunction_FastCallKeywords+0xfb) [0x55db603c120b]

##Environment

scikit-learn==0.20.3
scipy==1.4.1
coverage==4.5.4
coveralls==1.9.2
black==19.3b0
pytest==5.3.2
networkx==2.3
numpy==1.18.2
matplotlib==3.1.0
seaborn==0.9.0
tqdm==4.43.0
tensorboard_logger==0.1.0

torch~=1.5.1
dgl~=0.4.3.post2
pandas~=1.0.5
requests~=2.24.0
psutil~=5.7.2
joblib~=0.16.0

Python:3.7
PyTorch 1.5.1
DGL 0.4.1
rdkit=2019.09.2.

Reproduce result of usa-airports

Hi,Thank you for releasing your code. I am currently trying to reproduce the result of node classification experiment on US-Airport dataset, while I can't get as high as 68.3%. Is there any techniques I can use to get higher accuracy? Thanks!

Can`t running this code with low equipments

Hi,Thank you for releasing your code. I am currently trying to reproduce the result of pre-training experiment on E2E.

However I cant running this code because of my PCs Configuration is too low.

Is there any methods to deal with it?

Maybe lower the parameters of the experiment? I have tried many times to lower the batch-size to 16. But its still cant work. I have no idea to deal with it.

How can i run this code with the following equipment?

I am looking forward to your reply. Thanks you very much.

My PC`s System parameters
image

An error when finetuning on graph classification dataset

Hi,

I encountered an error when finetuning on imdb-binary dataset.

The running command is

bash scripts/finetune.sh saved/Pretrain_moco_True_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_32_hid_64_samples_2000_nce_t_0.07_nce_k_163841_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999/ 0 imdb-binary

The error message is
AttributeError: 'GraphClassificationDatasetLabeled' object has no attribute 'dataset'

Please see the screenshot:
Screen Shot 2020-12-13 at 8 53 17 PM

Thank you!

Regarding the choice of subgraph enhancement method.

Hello, I am trying to reproduce your code. I want to change the way of subgraph enhancement to neighbor sampling. However, when I changed the parameter of 'aug' in graph_dataset to 'ns', the pre-training process got stuck and the memory usage was abnormally high.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.