snap-stanford / graphrnn Goto Github PK

License: MIT License

Python 70.30% C++ 29.70%

graphrnn's Introduction

GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Model

This repository is the official PyTorch implementation of GraphRNN, a graph generative model using auto-regressive model.

Jiaxuan You*, Rex Ying*, Xiang Ren, William L. Hamilton, Jure Leskovec, GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Model (ICML 2018)

Installation

Install PyTorch following the instuctions on the official website. The code has been tested over PyTorch 0.2.0 and 0.4.0 versions.

conda install pytorch torchvision cuda90 -c pytorch

Then install the other dependencies.

pip install -r requirements.txt

Test run

python main.py

Code description

For the GraphRNN model: main.py is the main executable file, and specific arguments are set in args.py. train.py includes training iterations and calls model.py and data.py create_graphs.py is where we prepare target graph datasets.

For baseline models:

B-A and E-R models are implemented in baselines/baseline_simple.py.
Kronecker graph model is implemented in the SNAP software, which can be found in https://github.com/snap-stanford/snap/tree/master/examples/krongen (for generating Kronecker graphs), and https://github.com/snap-stanford/snap/tree/master/examples/kronfit (for learning parameters for the model).
MMSB is implemented using the EDWARD library (http://edwardlib.org/), and is located in baselines.
We implemented the DeepGMG model based on the instructions of their paper in main_DeepGMG.py.
We implemented the GraphVAE model based on the instructions of their paper in baselines/graphvae.

Parameter setting: To adjust the hyper-parameter and input arguments to the model, modify the fields of args.py accordingly. For example, args.cuda controls which GPU is used to train the model, and args.graph_type specifies which dataset is used to train the generative model. See the documentation in args.py for more detailed descriptions of all fields.

Outputs

There are several different types of outputs, each saved into a different directory under a path prefix. The path prefix is set at args.dir_input. Suppose that this field is set to ./:

./graphs contains the pickle files of training, test and generated graphs. Each contains a list of networkx object.
./eval_results contains the evaluation of MMD scores in txt format.
./model_save stores the model checkpoints
./nll saves the log-likelihood for generated graphs as sequences.
./figures is used to save visualizations (see Visualization of graphs section).

Evaluation

The evaluation is done in evaluate.py, where user can choose which settings to evaluate. To evaluate how close the generated graphs are to the ground truth set, we use MMD (maximum mean discrepancy) to calculate the divergence between two sets of distributions related to the ground truth and generated graphs. Three types of distributions are chosen: degree distribution, clustering coefficient distribution. Both of which are implemented in eval/stats.py, using multiprocessing python module. One can easily extend the evaluation to compute MMD for other distribution of graphs.

We also compute the orbit counts for each graph, represented as a high-dimensional data point. We then compute the MMD between the two sets of sampled points using ORCA (see http://www.biolab.si/supp/orca/orca.html) at eval/orca. One first needs to compile ORCA by

g++ -O2 -std=c++11 -o orca orca.cpp`

in directory eval/orca. (the binary file already in repo works in Ubuntu).

To evaluate, run

python evaluate.py

Arguments specific to evaluation is specified in class evaluate.Args_evaluate. Note that the field Args_evaluate.dataset_name_all must only contain datasets that are already trained, by setting args.graph_type to each of the datasets and running python main.py.

Visualization of graphs

The training, testing and generated graphs are saved at 'graphs/'. One can visualize the generated graph using the function utils.load_graph_list, which loads the list of graphs from the pickle file, and util.draw_graph_list, which plots the graph using networkx.

Misc

Jesse Bettencourt and Harris Chan have made a great slide introducing GraphRNN in Prof. David Duvenaud’s seminar course Learning Discrete Latent Structure.

graphrnn's People

Contributors

Stargazers

Watchers

Forkers

penghts vishalbelsare codes-kzhan ramonyeung shawnli zhshlii mr-wang119 vincizhou fendaq sjoerdapp manganganath geoslegend zhengxuyu ysun57 charlottesean markcheung zhu0619 zhangguanghui1 jlevy44 afcarl ximinwu cheng6076 rootlu shubhampachori12110095 lupalab smautner chaoyue729 hfxunlp uctoronto xifengbishu asdfghjkl510 yuew08 2673323862 zhenjason rfinkelberg youngleec briantimar briancylui hya-cala aashaybhupendradoshi sunjiao123sun cslele 9578577 daweicheng sienna13 ccfbupt sj-huang ahcheriet llouice tonycmu syyunn yfqiu98 satyakisikdar fagan2888 kfzyqin qin-folks young917 siddhu001 oldman-ding shlim1 lxlsu jackal092927 penf sofiaelenahopartean xc15071347094 milkigit xrosliang yaxche-io danieltsoukup wliuxingxiangyu daniel1991zy xucpeng caiodadauto lyazs vsumanth99 joaopedromattos brando90 hmmgnn jiehu-cv wgc31524 mary-dot123 ramitnv dzjin5678 soroormotie jongkook-heo guanhuhao sclipman yizhidamiaomiao sunjinhao123 jimfhahn susan1314 1752756228 msb1002 changzhijiang netlabcode zzlsms alecplotkin truongchien huy1711 mdai26

graphrnn's Issues

GraphRNN with node feature

Firstly, thanks a lot for your perfect work, especially the comprehensive baselines. However, as you mentioned in appendix A.6, GraphRNN can be used as a node and edge feature generation. I wonder if you have finished this part in this repository?

Why does the model pack_padded after linear transform?

Hello @JiaxuanYou. I have a question in model.py.

In forward part of GRU_plain, why does it pack_padded_sequence after linear transform by self.input?

Doesn't the non-zero raw get the information of padded 0?

Thanks!

class GRU_plain(nn.Module): 
    def init(self, input_size, embedding_size, hidden_size, num_layers, has_input=True, has_output=False, output_size=None): 
        super(GRU_plain, self).init() 
        self.num_layers = num_layers 
        self.hidden_size = hidden_size 
        self.has_input = has_input 
        self.has_output = has_output
        if has_input:
            self.input = nn.Linear(input_size, embedding_size) 
            self.rnn = nn.GRU(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers,
                              batch_first=True) 
        else:
            self.rnn = nn.GRU(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        if has_output:
            self.output = nn.Sequential(
                nn.Linear(hidden_size, embedding_size),
                nn.ReLU(),
                nn.Linear(embedding_size, output_size)
            )
    
        self.relu = nn.ReLU()
        # initialize
        self.hidden = None  # need initialize before forward run
    
        for name, param in self.rnn.named_parameters():
            if 'bias' in name:
                nn.init.constant(param, 0.25)
            elif 'weight' in name:
                nn.init.xavier_uniform(param,gain=nn.init.calculate_gain('sigmoid'))
        for m in self.modules():
            if isinstance(m, nn.Linear):
                m.weight.data = init.xavier_uniform(m.weight.data, gain=nn.init.calculate_gain('relu'))
    
    def init_hidden(self, batch_size): 
        return Variable(torch.zeros(self.num_layers, batch_size, self.hidden_size)).cuda()
    
    def forward(self, input_raw, pack=False, input_len=None):
        if self.has_input:
            input = self.input(input_raw) 
            input = self.relu(input)
        else:
            input = input_raw
        if pack:
            input = pack_padded_sequence(input, input_len, batch_first=True) 
        output_raw, self.hidden = self.rnn(input, self.hidden)
        if pack:
            output_raw = pad_packed_sequence(output_raw, batch_first=True)[0]
        if self.has_output:
        output_raw = self.output(output_raw)
    # return hidden state at each time step
    return output_raw

Trouble running evaluate.py

I am having trouble running evaluate.py after training the graphRNN. The issue seems to be the function load_graph_list in utils.py (starting at line 459):

# load a list of graphs
def load_graph_list(fname,is_real=True):

    for i in range(len(graph_list)):
        edges_with_selfloops = graph_list[i].selfloop_edges()
        if len(edges_with_selfloops)>0:
            graph_list[i].remove_edges_from(edges_with_selfloops)
        if is_real:
            graph_list[i] = max(nx.connected_component_subgraphs(graph_list[i]), key=len)
            graph_list[i] = nx.convert_node_labels_to_integers(graph_list[i])
        else:
            graph_list[i] = pick_connected_component_new(graph_list[i])
    return graph_list

This function calls the undefined list graph_list, and it does not load any graphs from file. I suspect that the code listed here was intended for a different function.

Unable to run the test program

Hi @JiaxuanYou ,
I tried to run this code but got following error:
python main.py Traceback (most recent call last): ........... from utils import * File "/XXXX/GraphGeneration/GraphRNN/utils.py", line 13, in <module> import community ImportError: No module named community

I solved this using following commands:
pip install --upgrade --force-reinstall python-louvain
and
pip install community
I suggest to add community module to the requirement file.

But even after this fix, i am unable to run the program. i get following error:
........ File "XXX/GraphGeneration/GraphRNN/model.py", line 988 prob = x_prev @ x_last.permute(0,2,1) ^ SyntaxError: invalid syntax

Do you have any suggestion to fix this error ? i am using Mac 10.12.6 and Python 2.7.13 :: Anaconda custom (x86_64)

Thanks
Sumit

CUDA Error

Hello @JiaxuanYou,

When I ran python3 main.py.
I got the following message even after run all installations commands.

CUDA 1
File name prefix GraphRNN_RNN_grid_4_128_
graph_validate_len 199.5
graph_test_len 215.0
total graph num: 100, training set: 80
max number node: 361
max/min number edge: 684; 180
max previous node: 40
train and test graphs saved at:  ./graphs/GraphRNN_RNN_grid_4_128_test_0.dat
PERSONAL_PATH/GraphRNN/model.py:299: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  nn.init.xavier_uniform(param,gain=nn.init.calculate_gain('sigmoid'))
PERSONAL_PATH/GraphRNN/model.py:297: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(param, 0.25)
PERSONAL_PATH/GraphRNN/model.py:302: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  m.weight.data = init.xavier_uniform(m.weight.data, gain=nn.init.calculate_gain('relu'))
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCGeneral.cpp line=51 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "main.py", line 128, in <module>
    has_output=True, output_size=args.hidden_size_rnn_output).cuda()
  File "/home/ronneesley/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/ronneesley/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/ronneesley/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 199, in _apply
    param.data = fn(param.data)
  File "/home/ronneesley/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/ronneesley/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCGeneral.cpp:51

questions in main.py

hi, I have some questions in the main.py.

you save the whole graphs as training data ad test test data?

To get train and test set, after loading you need to manually slice

save_graph_list(graphs, args.graph_save_path + args.fname_train + '0.dat')
save_graph_list(graphs, args.graph_save_path + args.fname_test + '0.dat')
print('train and test graphs saved at: ', args.graph_save_path + args.fname_test + '0.dat')
and here should I assign the args.max_prev_node?

dataset initialization

if 'nobfs' in args.note:
    print('nobfs')
    dataset = Graph_sequence_sampler_pytorch_nobfs(graphs_train, max_num_node=args.max_num_node)
    args.max_prev_node = args.max_num_node-1
if 'barabasi_noise' in args.graph_type:
    print('barabasi_noise')
    dataset = Graph_sequence_sampler_pytorch_canonical(graphs_train,max_prev_node=args.max_prev_node)
    args.max_prev_node = args.max_num_node - 1
else:
    dataset = Graph_sequence_sampler_pytorch(graphs_train,max_prev_node=args.max_prev_node,max_num_node=args.max_num_node)
sample_strategy = torch.utils.data.sampler.WeightedRandomSampler([1.0 / len(dataset) for i in range(len(dataset))],
                                                                 num_samples=args.batch_size*args.batch_ratio, replacement=True)

or I would get an error like "TypeError: new(): argument 'size' must be tuple of ints, but found element of type NoneType at pos 2" here

elif 'GraphRNN_RNN' in args.note:
    rnn = GRU_plain(input_size=args.max_prev_node, embedding_size=args.embedding_size_rnn,
                    hidden_size=args.hidden_size_rnn, num_layers=args.num_layers, has_input=True,
                    has_output=True, output_size=args.hidden_size_rnn_output)
    output = GRU_plain(input_size=1, embedding_size=args.embedding_size_rnn_output,
                       hidden_size=args.hidden_size_rnn_output, num_layers=args.num_layers, has_input=True,
                       has_output=True, output_size=1)

looking forward to you reply =]

Why having output_dim = max_num_node * (max_num_node + 1) // 2 in baselie/graphvae/mode.py file ?

Covariate Adjustment + Single Graph + CPU usage

I know how to convert these torch tensors to CPU, I am curious though why the user does not have the choice to run CUDA.

Also, I have a custom graph with some covariates, how can I train on these to compare to ERGM? Particularly, what would you recommend for using covariates for graphRNN generation?

I'm just having trouble locating where to input a single graph for this pipeline, as the paper said that single or multiple graphs can be used.

A question about the datasets

hello,Is your data directly used and processed by others? What should I do if I want to use my own data? Do you know how to change from cora data set to ind.cora.x etc.?