betterenvi / gspan Goto Github PK

View Code? Open in Web Editor NEW

190.0 6.0 88.0 1.1 MB

Python implementation of frequent subgraph mining algorithm gSpan. Directed graphs are supported.

Home Page: https://pypi.org/project/gspan-mining/

License: MIT License

Python 8.68% Jupyter Notebook 90.38% Roff 0.93%

gspan mining-frequent-subgraphs data-mining graph-algorithms graph-mining

gspan's People

Contributors

Stargazers

Watchers

Forkers

nicetocu ligaoheng aljemabi geraore drdougphd karpnv balancewing chiddianozie quach2502 puppy95 tl-petter vishalbelsare shivtelo sandy4321 zjzijielu kencheong martin512512 zhangxuemiao wsgan001 superinno logomanwolf lightnessowner drewsherlock masahito-ishii alinebrito geek-in-side mislam5285 justiket 962372515csn pangshibao mrliliang kesyren tiankonghenlan20113046 joanlenczuk sihamamarouche jiadong324 mamat-jumana satyakam001 ncryer naazs03 dlduddk xhy13770656533 raekawu hongfaatuo chenzhiming-296 valgt linjiarui blumewas ljq-ibm ericaguoqiuyu student4321 bowdbeg ngocdang499 annawimbauer astromis longxiang-xiong rcapshaw chengmonk zongwuwang krishnasai-18 ali-e zviri nevliin rongjingao xzm2000 xierunzhi levi7dev sero97 liuyuru156 cleancoindev donaldxu taingload dinabishr kriss25 pianosounds xiaohuozz rahma610 morningm00n matt-81 degenhardt91 liyihang005 dazhuoq research-fork csc17103 luminous1996 shiridikumar fushun9675

gspan's Issues

Question about calculating support by considering pattern occurrences inside each graph

Hi, this work is great and very helpful, but I notice that the policy to calculate the support of a certain pattern is to count the same pattern for only one valid time inside each graph.

For example, if a dataset contains 2 graphs: t # 0 and t # 1, a certain pattern occurs 3 times inside graph t # 0 and occurs 4 times inside t # 1, the result of mining will be this pattern with the support of 2, not 3+4=7, which is the situation I've been trying to do.

I looked through into the code and found in gspan.py line 314

def _get_support(self, projected):
        return len(set([pdfs.gid for pdfs in projected]))

I think this function is used to calculate the support of each pattern, as set is used, only different graph(gid) will be counted, and the situation inside each graph is not considered.

In order to achieve the goal I mentioned above, I changed pdfs.gid into pdfs.edge, I suppose that by counting different edges, it will get the real support of each pattern.

Now, this part of code looks like this:

def _get_support(self, projected):
        return len(set([pdfs.edge for pdfs in projected]))

However, after several tests on dataset graph.data.simple.5 and graph.data.5, I compared the result of the algorithm with my counting result by hand, and found that the result by the algorithm is always 2 times larger than the real result(eg. 5 times by hand, but 10 times by algorithm), and this is the command I used:

python main.py -s 5 ./graphdata/graph.data.5

So I think it is not about directed or undirected graph, and I wonder if you could help me and tell me whether I adjusted the wrong code or whether this goal could be realized.

Thank you very much.

problem with output

Hello!

When I am running the test dataset I do not have any issues. However, when I run my own dataset I was unable to produce the output.

My tables look like this:

t # 4 
v 0 MAP3K1 
v 1 ASXL1 
v 2 JUN 
v 3 RICTOR 
v 4 GSK3B 
v 5 IKZF1 
v 6 NFKBIA 
v 7 TSC1 
e 0 4 4
e 1 5 4
e 2 6 4
e 3 7 4

My output look like this:

$ python3 -m gspan_mining -s 5000 db_g_labeled.data 
Read:	0.01 s
Mine:	0.03 s
Total:	0.04 s
<gspan_mining.gspan.gSpan object at 0x10ca3c7b8>

Also, I was wondering is I can run graphs with unlabeled edges.

Thank you for your help!

Only trees output for directed graphs

Thank you for providing this gSpan implementation.
For me, it is incredibly fast. Unfortunately, in the result, there are only paths and trees, but no cyclic graphs.
Is this a bug?

对单个图进行频繁子图挖掘的结果不对

对单个图进行频繁子图挖掘的结果不对，是不是不支持？

strange characters

Hi,

In the readme file under graphdata folder, there are some strange characters as follows,

NOTICE: 
1.	All labels cannot be ¡°0¡± or ¡°1¡±,  and it should be larger than ¡°1¡±;
2.  Each data file or query file should end with ¡° t # -1¡±, otherwise it will lead to a bug.

Can you please take a look? Thanks.

Time-consuming

您好！
您的程序写的很棒！但是我遇到了一个问题，同样规模的问题（5000张图，筛选出支持度大于500的），用 matlab 的 gboost 程序大约3分钟完成，但是用这个 python 的跑了1个小时还没结束…

Hi!
Nice work! But I got a problem: selecting Min Support = 500 among 5,000 graphs, it takes about 3 minutes with gboost in matlab; however, this python application has been running more than 1 hour and it is not going to stop. What can I do?

How to map the patterns in the input graph

Thanks for the code. I tried it and it is working fine.

However, I have an issue. Once i have the patterns, how can i match them or find them in the input graph. I read the paper about the algorithm and i know that vertexes are relabeled. How can i use the patterns since i cannot match them in the input?
I've been stocked here for a couple of days and i will really appreciate your help.

Thanks.

对于任意一个图是否可以在一定时间内获取到它的最小DFSCode

对于任意一组图，能否通过GSpan算法，对每个图进行编码，获取到最小DFScode，用来比较图同构呢？

Directed graph not discovered

I would like to point out that some directed graphs are not being discovered by the algorithm. Input example:

t # 0
v 0 2
v 1 1
v 2 3
e 0 1 1
e 2 0 2
t # -1

With a support =1, the algorithm outputs as frequent the subgraphs with 2 vertices but the algorithm seems not to generate the candidate with 3 vertices. Is there something I am missing? The following is the code's output:

t # 0
v 0 2
v 1 1
e 0 1 1

t # 1
v 0 3
v 1 2
e 0 1 2

Thanks in advance for your time.

Directed Graph

I think I have stumble in an issue with a directed graph. The algorithm seems to detect that an edge exists when it does not. Here's the file I am using: https://pastebin.com/ypsbcKuh
Parameters: '-s 1 -d True -l 3 -w True -p True graph.data'
It detects that a pattern composed of depth 3 exists when my graphs are depth 2 max. The problem occurs when detecting an nonexistent pattern in several graphs, e.g. 3

The following graph is discovered for example and its claimed to exist in t_3:

t # 45
v 0 0,4
v 1 1,0
v 2 2,0
v 3 4,0
e 0 1 0
e 0 2 0
e 0 3 0

But now, look at t_3...

t # 3
v 0 0,1
v 1 4,0
v 2 0,4
v 3 2,0
v 4 1,0
v 5 2,0
v 6 0,2
v 7 0,2
e 0 1 0
e 2 3 0
e 2 4 0
e 2 5 0
e 2 1 0
e 6 5 0
e 6 1 0
e 7 3 0
e 7 1 0

node label is different than edge label in the original gspan algorithm

I wanted to bring to your attention a difference I observed between your implementation and the gSpan algorithm implemented in Java, which I have been using as an alternative. In the Java implementation (https://github.com/nphdang/gSpan/tree/master/Data), the treatment of node labels differs from that of edge labels. However, in your code, I noticed that the node label and the edge label are used interchangeably.

I appreciate your work and am reaching out to understand if there's a specific reason for this difference in label handling. Your insights on this matter would be valuable.

Thank you for your time and consideration.

刚使用gSpan有几个问题

1.是否适用于带权图呢，如果可以数据中怎么定义权值？
2.如果图中只有两个节点，networkx画出来是两个节点重合在一起，这个地方应该哪里改？

Citing the project

Hello,

I am writing this issue/question on how to cite this repository. Is there a preffered way from you guys on how to do this?

I am using the code of this repository in my Bachelor Thesis on structure mining. I am citing the paper you are referencing in the projects readme.

Thanks for your answer and best regards.

是否内在一个大的连通图内部进行频繁子图挖掘

About minimum support

请教关于forward edge的问题

请问在_get_forward_pure_edges 和 _get_forward_rmpath_edges时，
if min_vlb <= g.vertices[e.to].vlb
以及
if (rm_edge.to == e.to or
min_vlb > new_to_vlb or
history.has_vertex(e.to)):
continue
这里比较边的时候，为什么比较的顶点的值，而不是编号呢？不应该是比较顶点在原图中的编号来确定最右路径吗？

Support for counting repeated substructure patterns in the same graph

Hi,

It is a nice implementation of gSpan, and I appreciate this project and the efforts you have paid deeply. Specifically, I wish to revise the implementation a little bit to count the repeated patterns in the same graph at the same time, not using the support threshold. Is it there any part I should be cautious? Or any shortcuts in the original implementation that would make this much easier? Thanks !

Your regards,

修改代码的请教

你好我通过测试也发现了这个问题，第一个图有2个同构图，第二个图有3个同构图。那么加在一起应该5个。可是运行得到2个。可否告知一下，如果我想得到5个，该如何修改代码呢？

Add upper_bound for mined freq. subgraphs

Hi, first of all thank you for the great work! As I mentioned in an earlier issue I used your package for the practical part in my Bachelor Thesis.

I have made some modifications in the program and want to share with other programmers around so that they can hopefully profit from my work.

Modification

As in the gSpan-paper described the experiments use an upper boundary for the max graphs generated. For some datasets I have used, the process seemed endless, especially with low suppport. So I have added the argument -mm <number> or --max_mining <number> to help the algorithm stop the mining process if the count of generated features reaches the <number> passed with the argument.

'str' and 'Vertex' type comparison error

While trying with our own generated data, we encountered this error:

in _get_backward_edge
    if g.vertices[e1.frm].vlb < g.vertices[e2.to] or (
TypeError: '<' not supported between instances of 'str' and 'Vertex'

We fixed this line as:

if g.vertices[e1.frm].vlb < g.vertices[e2.to].vlb or (
       g.vertices[e1.frm].vlb == g.vertices[e2.to].vlb and
       e1.elb <= e.elb):
    return e

What could be the problem with this? Here is our test data:
graph.gspan.data.txt

We run with this script:

if __name__ == "__main__":
    args_str = '-s 5 -d True -l 1 -p False -w True graph.gspan.data.txt'
    FLAGS, _ = parser.parse_known_args(args=args_str.split())
    gs = main(FLAGS)

how to get the vertices id from the original data

hi, @betterenvi i have a question about how to get the vertices id from the original data after geting the frequent_subgraphs . you use projected to get where one frequent subgraph appears in database , so i think i can use the projected to get the vertices' formid and toid ,but i found it's wrong , its' result only get parts of the subgraphs not all subgraphs of the origin graph? i can't find how to get the original vertices id, hope you can help thx!

   if self._where:
        print('where: {}'.format(list(set([p.gid for p in projected]))))
        for p in projected:
            print("p:"+str(p.gid))
            print(p.edge.frm,p.edge.to,p.edge.elb)

t # 12
v 0
v 1 3
v 2 4
e 0 1 3
e 1 2 2
Support: 2
where: [0, 3]
p:0
3 5 2
p:0
9 5 2
p:3
3 5 2

联通图数量过高，占用内存过高问题

你好我这边实验的时候有16000个联通图，support设置的4000，内存最高到了20g然后被系统杀死