Giter VIP home page Giter VIP logo

gspan's Introduction

gSpan

For Chinese readme, please go to README-Chinese.

gSpan is an algorithm for mining frequent subgraphs.

This program implements gSpan with Python. The repository on GitHub is https://github.com/betterenvi/gSpan. This implementation borrows some ideas from gboost.

Undirected Graphs

This program supports undirected graphs, and produces same results with gboost on the dataset graphdata/graph.data.

Directed Graphs

So far(date: 2016-10-29), gboost does not support directed graphs. This program implements gSpan for directed graphs. More specific, this program can mine frequent directed subgraph that has at least one node that can reach other nodes in the subgraph. But correctness is not guaranteed since the author did not do enough testing. After running several times on datasets graphdata/graph.data.directed.1 and graph.data.simple.5, there is no fault.

How to install

This program supports both Python 2 and Python 3.

Method 1

Install this project using pip:

pip install gspan-mining
Method 2

First, clone the project:

git clone https://github.com/betterenvi/gSpan.git
cd gSpan

You can optionally install this project as a third-party library so that you can run it under any path.

python setup.py install

How to run

The command is:

python -m gspan_mining [-s min_support] [-n num_graph] [-l min_num_vertices] [-u max_num_vertices] [-d True/False] [-v True/False] [-p True/False] [-w True/False] [-h] database_file_name 
Some examples
  • Read graph data from ./graphdata/graph.data, and mine undirected subgraphs given min support is 5000
python -m gspan_mining -s 5000 ./graphdata/graph.data
  • Read graph data from ./graphdata/graph.data, mine undirected subgraphs given min support is 5000, and visualize these frequent subgraphs(matplotlib and networkx are required)
python -m gspan_mining -s 5000 -p True ./graphdata/graph.data
  • Read graph data from ./graphdata/graph.data, and mine directed subgraphs given min support is 5000
python -m gspan_mining -s 5000 -d True ./graphdata/graph.data
  • Print help info
python -m gspan_mining -h

The author also wrote example code using Jupyter Notebook. Mining results and visualizations are presented. For detail, please refer to main.ipynb.

Running time

  • Environment

    • OS: Windows 10
    • Python version: Python 2.7.12
    • Processor: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz 3.60 GHz
    • Ram: 8.00 GB
  • Running time On the dataset ./graphdata/graph.data, running time is listed below:

Min support Number of frequent subgraphs Time
5000 26 51.48 s
3000 52 69.07 s
1000 455 3 m 49 s
600 1235 7 m 29 s
400 2710 12 m 53 s

Reference

gSpan: Graph-Based Substructure Pattern Mining, by X. Yan and J. Han. Proc. 2002 of Int. Conf. on Data Mining (ICDM'02).

One C++ implementation of gSpan.

gspan's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gspan's Issues

Only trees output for directed graphs

Thank you for providing this gSpan implementation.
For me, it is incredibly fast. Unfortunately, in the result, there are only paths and trees, but no cyclic graphs.
Is this a bug?

problem with output

Hello!

When I am running the test dataset I do not have any issues. However, when I run my own dataset I was unable to produce the output.

My tables look like this:

t # 4 
v 0 MAP3K1 
v 1 ASXL1 
v 2 JUN 
v 3 RICTOR 
v 4 GSK3B 
v 5 IKZF1 
v 6 NFKBIA 
v 7 TSC1 
e 0 4 4
e 1 5 4
e 2 6 4
e 3 7 4

My output look like this:

$ python3 -m gspan_mining -s 5000 db_g_labeled.data 
Read:	0.01 s
Mine:	0.03 s
Total:	0.04 s
<gspan_mining.gspan.gSpan object at 0x10ca3c7b8>

Also, I was wondering is I can run graphs with unlabeled edges.

Thank you for your help!

Time-consuming

您好!
您的程序写的很棒!但是我遇到了一个问题,同样规模的问题(5000张图,筛选出支持度大于500的),用 matlab 的 gboost 程序大约3分钟完成,但是用这个 python 的跑了1个小时还没结束…

Hi!
Nice work! But I got a problem: selecting Min Support = 500 among 5,000 graphs, it takes about 3 minutes with gboost in matlab; however, this python application has been running more than 1 hour and it is not going to stop. What can I do?

how to get the vertices id from the original data

hi, @betterenvi i have a question about how to get the vertices id from the original data after geting the frequent_subgraphs . you use projected to get where one frequent subgraph appears in database , so i think i can use the projected to get the vertices' formid and toid ,but i found it's wrong , its' result only get parts of the subgraphs not all subgraphs of the origin graph? i can't find how to get the original vertices id, hope you can help thx!

   if self._where:
        print('where: {}'.format(list(set([p.gid for p in projected]))))
        for p in projected:
            print("p:"+str(p.gid))
            print(p.edge.frm,p.edge.to,p.edge.elb)

t # 12
v 0
v 1 3
v 2 4
e 0 1 3
e 1 2 2
Support: 2
where: [0, 3]
p:0
3 5 2
p:0
9 5 2
p:3
3 5 2

Directed graph not discovered

I would like to point out that some directed graphs are not being discovered by the algorithm. Input example:

t # 0
v 0 2
v 1 1
v 2 3
e 0 1 1
e 2 0 2
t # -1

With a support =1, the algorithm outputs as frequent the subgraphs with 2 vertices but the algorithm seems not to generate the candidate with 3 vertices. Is there something I am missing? The following is the code's output:

t # 0
v 0 2
v 1 1
e 0 1 1

t # 1
v 0 3
v 1 2
e 0 1 2

Thanks in advance for your time.

请教关于forward edge的问题

请问在_get_forward_pure_edges 和 _get_forward_rmpath_edges时,
if min_vlb <= g.vertices[e.to].vlb
以及
if (rm_edge.to == e.to or
min_vlb > new_to_vlb or
history.has_vertex(e.to)):
continue
这里比较边的时候,为什么比较的顶点的值,而不是编号呢?不应该是比较顶点在原图中的编号来确定最右路径吗?

Support for counting repeated substructure patterns in the same graph

Hi,

It is a nice implementation of gSpan, and I appreciate this project and the efforts you have paid deeply. Specifically, I wish to revise the implementation a little bit to count the repeated patterns in the same graph at the same time, not using the support threshold. Is it there any part I should be cautious? Or any shortcuts in the original implementation that would make this much easier? Thanks !

Your regards,

Question about calculating support by considering pattern occurrences inside each graph

Hi, this work is great and very helpful, but I notice that the policy to calculate the support of a certain pattern is to count the same pattern for only one valid time inside each graph.

For example, if a dataset contains 2 graphs: t # 0 and t # 1, a certain pattern occurs 3 times inside graph t # 0 and occurs 4 times inside t # 1, the result of mining will be this pattern with the support of 2, not 3+4=7, which is the situation I've been trying to do.

I looked through into the code and found in gspan.py line 314

def _get_support(self, projected):
        return len(set([pdfs.gid for pdfs in projected]))

I think this function is used to calculate the support of each pattern, as set is used, only different graph(gid) will be counted, and the situation inside each graph is not considered.

In order to achieve the goal I mentioned above, I changed pdfs.gid into pdfs.edge, I suppose that by counting different edges, it will get the real support of each pattern.

Now, this part of code looks like this:

def _get_support(self, projected):
        return len(set([pdfs.edge for pdfs in projected]))

However, after several tests on dataset graph.data.simple.5 and graph.data.5, I compared the result of the algorithm with my counting result by hand, and found that the result by the algorithm is always 2 times larger than the real result(eg. 5 times by hand, but 10 times by algorithm), and this is the command I used:

python main.py -s 5 ./graphdata/graph.data.5

So I think it is not about directed or undirected graph, and I wonder if you could help me and tell me whether I adjusted the wrong code or whether this goal could be realized.

Thank you very much.

node label is different than edge label in the original gspan algorithm

I wanted to bring to your attention a difference I observed between your implementation and the gSpan algorithm implemented in Java, which I have been using as an alternative. In the Java implementation (https://github.com/nphdang/gSpan/tree/master/Data), the treatment of node labels differs from that of edge labels. However, in your code, I noticed that the node label and the edge label are used interchangeably.

I appreciate your work and am reaching out to understand if there's a specific reason for this difference in label handling. Your insights on this matter would be valuable.

Thank you for your time and consideration.

Directed Graph

I think I have stumble in an issue with a directed graph. The algorithm seems to detect that an edge exists when it does not. Here's the file I am using: https://pastebin.com/ypsbcKuh
Parameters: '-s 1 -d True -l 3 -w True -p True graph.data'
It detects that a pattern composed of depth 3 exists when my graphs are depth 2 max. The problem occurs when detecting an nonexistent pattern in several graphs, e.g. 3

The following graph is discovered for example and its claimed to exist in t_3:

t # 45
v 0 0,4
v 1 1,0
v 2 2,0
v 3 4,0
e 0 1 0
e 0 2 0
e 0 3 0

But now, look at t_3...

t # 3
v 0 0,1
v 1 4,0
v 2 0,4
v 3 2,0
v 4 1,0
v 5 2,0
v 6 0,2
v 7 0,2
e 0 1 0
e 2 3 0
e 2 4 0
e 2 5 0
e 2 1 0
e 6 5 0
e 6 1 0
e 7 3 0
e 7 1 0

刚使用gSpan有几个问题

1.是否适用于带权图呢,如果可以数据中怎么定义权值?
2.如果图中只有两个节点,networkx画出来是两个节点重合在一起,这个地方应该哪里改?

修改代码的请教

你好 我通过测试也发现了这个问题 ,第一个图有2个同构图,第二个图有3个同构图。那么加在一起应该5个。可是运行得到2个 。可否告知一下,如果我想得到5个,该如何修改代码呢?

strange characters

Hi,

In the readme file under graphdata folder, there are some strange characters as follows,

NOTICE: 
1.	All labels cannot be ¡°0¡± or ¡°1¡±,  and it should be larger than ¡°1¡±;
2.  Each data file or query file should end with ¡° t # -1¡±, otherwise it will lead to a bug.

Can you please take a look? Thanks.

'str' and 'Vertex' type comparison error

While trying with our own generated data, we encountered this error:

in _get_backward_edge
    if g.vertices[e1.frm].vlb < g.vertices[e2.to] or (
TypeError: '<' not supported between instances of 'str' and 'Vertex'

We fixed this line as:

if g.vertices[e1.frm].vlb < g.vertices[e2.to].vlb or (
       g.vertices[e1.frm].vlb == g.vertices[e2.to].vlb and
       e1.elb <= e.elb):
    return e

What could be the problem with this? Here is our test data:
graph.gspan.data.txt

We run with this script:

if __name__ == "__main__":
    args_str = '-s 5 -d True -l 1 -p False -w True graph.gspan.data.txt'
    FLAGS, _ = parser.parse_known_args(args=args_str.split())
    gs = main(FLAGS)

Add upper_bound for mined freq. subgraphs

Hi, first of all thank you for the great work! As I mentioned in an earlier issue I used your package for the practical part in my Bachelor Thesis.

I have made some modifications in the program and want to share with other programmers around so that they can hopefully profit from my work.

Modification

As in the gSpan-paper described the experiments use an upper boundary for the max graphs generated. For some datasets I have used, the process seemed endless, especially with low suppport. So I have added the argument -mm <number> or --max_mining <number> to help the algorithm stop the mining process if the count of generated features reaches the <number> passed with the argument.

How to map the patterns in the input graph

Hi

Thanks for the code. I tried it and it is working fine.

However, I have an issue. Once i have the patterns, how can i match them or find them in the input graph. I read the paper about the algorithm and i know that vertexes are relabeled. How can i use the patterns since i cannot match them in the input?
I've been stocked here for a couple of days and i will really appreciate your help.

Thanks.

Citing the project

Hello,

I am writing this issue/question on how to cite this repository. Is there a preffered way from you guys on how to do this?

I am using the code of this repository in my Bachelor Thesis on structure mining. I am citing the paper you are referencing in the projects readme.

Thanks for your answer and best regards.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.