bdy9527 / sdcn Goto Github PK

View Code? Open in Web Editor NEW

249.0 3.0 69.0 90.45 MB

Structural Deep Clustering Network

License: Apache License 2.0

Python 100.00%

deep-clustering graph-convolutional-networks self-supervised-learning knn-graphs autoencoder

sdcn's Introduction

SDCN

Structural Deep Clustering Network

Paper

https://arxiv.org/abs/2002.01633

https://github.com/461054993/SDCN/blob/master/SDCN.pdf

Dataset

Due to the limitation of file size, the complete data can be found in

Baidu Netdisk:

graph: 链接:https://pan.baidu.com/s/1MEWr1KyrtBQndVNy8_y2Lw 密码:opc1

data: 链接:https://pan.baidu.com/s/1kqoWlElbWazJyrTdv1sHNg 密码:1gd4

Google Drive:

graph: https://drive.google.com/file/d/10rnVwIAuVRczmZJSX7mpSTR0-HVnMWLh/view?usp=sharing

data: https://drive.google.com/file/d/1VjH6xqt82GaQwwiy-4O2GedMgQMLN6dm/view?usp=sharing

Code

python sdcn.py --name [usps|hhar|reut|acm|dblp|cite]

Q&A

Q: Why do not use distribution Q to supervise distribution P directly?
A: The reasons are two-fold: 1) Previous method has considered to use the clustering assignments as pseudo labels to re-train the encoder in a supervised manner, i.e., deepCluster. However, in experiment, we find that the gradient of cross-entropy loss is too violent to prevent the embedding spaces from disturbance. 2) Although we can replace the cross-entropy loss with KL divergence, there is still a problem that we worried about, that is, there is no clustering information. The original intention of our research on deep clustering is to integrate the objective of clustering into the powerful representation ability of deep learning. Therefore, we introduce the distribution P to increase the cohesion of clustering performance, the details can be found in DEC.
Q: How to apply SDCN to other datasets?
A: In general, if you want to apply our model to other datasets, three steps are required.
1. Construct the KNN graph based on the similarity of features. Details can be found in calcu_graph.py.
2. Pretrain the autoencoder and save the pre-trained model. Details can be found in data/pretrain.py.
3. Replace the args in sdcn.py and run the code.

Reference

If you make advantage of the SDCN model in your research, please cite the following in your manuscript:

@inproceedings{sdcn2020,
  author    = {Deyu Bo and
               Xiao Wang and
               Chuan Shi and
               Meiqi Zhu and
               Emiao Lu and
               Peng Cui},
  title     = {Structural Deep Clustering Network},
  booktitle = {{WWW}},
  pages     = {1400--1410},
  publisher = {{ACM} / {IW3C2}},
  year      = {2020}
}

sdcn's People

Contributors

Stargazers

Watchers

sdcn's Issues

how to pre-train the model?

I follow the parameters setting in the paper, to pre-train the model. However, the final result is not very good.
Can you post the pre-training code?

AE预训练文件

您好，有一些问题向您请教。
我在学习您代码时发现通过您提供的预训练文件预训练的AE模型在实验中的聚类指标比较差，请问您是如何得到的AE预训练的模型的，方便帮帮我吗

How to ‘construct_graph’ ？

感谢您的工作，我在看代码的时候发现在calcu_graph.py中进行construct_graph时需要用到标签数据，请问对于无标签的非graph数据或者无标签的graph数据，应该怎样去构造graph？

实验结果

请问作者您是先训练好一个效果不错的ae模型，再对SDCN进行训练。还是每次实验都重新预训练AE，再训练SDCN。

Citeseer dataset is missing

The attribute of Citeseer is misssing (cite.txt) in the data directory. Could you kindly upload it?

Eq(7)?

Hi, I can not find where is Eq(7) in your codes.

Graph construct

why do you add this line before computing distance?
features[features > 0] = 1

How can I get the Pre-trained pkl file using my own data?

首先非常感谢您的开源，我从您的工作中学到了很多！同时我也有两个问题想向您请教。
1）K的设置问题。我看您在文中讨论了KNN算法中K的设置问题，请问一下有没有讨论在K-means聚类时K的指定问题呢？还是说这里的K直接使用数据集中给出来的真实类别数了？
2）训练过程中聚类中心 $\mu$ 的更新问题。我看您在伪代码中没有显示地指出来聚类中心 $\mu$是否会随着训练更新，我看了DEC和IDEC的论文，发现是要更新的；您在代码里面也是把聚类中心作为模型参数了，所以想确认一下，$\mu$ 是会更新的是吧？
谢谢！

Why do you use BOW instead of Glove?

about graph

目前有以下问题，期待您的回答
(1)按我的理解，acm、cite等也是一种graph数据，为什么本身是一种graph，也会出现acm_graph、cite_graph数据呢？
(2)在graph中，每行的两个数据，例如 31，2016 是代表第31和2016个样本之间是有edge的关系吗？

该算法是不是需要指定聚类的类别数？和k-means类似，需要指定K?

the ce_loss = inf when I changed the dataset to cora

Hi，and as a result of the situation，the performance of model is terrible（nmi=0.17）. What should I do to avoid it？ PS: I have excuted the pretrain.py to obtain pretrain weight.

eva函数存在问题

在evaluation.py文件里面的评价函数eva，调用了自己写的acc/f1函数，里面存在很大的问题
代码中的acc函数如下

在两者类不匹配的时候，会去改写pred的结果，而且这是直接修改原始数据，会导致后续的nmi/ari计算使用被修改了的pred数据，使得acc/nmi/ari/f1这四个评估指标与真实存在差异
比如如下，使用sklearn.metrics提供的指标计算函数

在自行构造的特殊场景下，有着很大的差别，假设存在100个点，并且类别都不一样，其原始分类为[0, ..., 100]，现在假设全都预测为同一类，此时正常的eval应该是很差的，但是evaluation.py里的eva函数的数据结果却为极好

some questions

your parper have a balance value of 0.5 , however your code this value=1,Can you tell me why?

question about the dataset (ACM)

Thanks for your great work.
I have a small question about your ACM dataset. I noticed you mentioned in the paper that you "divide the papers into three classes (database, wireless communication, data mining) by their research area". Does that mean the "database" is labelled as 0, "wireless communication" is 1, "data mining" is 2.

Just wanna better understand the graph and the results. Thx

data和graph数据集之间的关系？

你好请问data文件夹和graph文件夹里对应数据集的关系是什么呢？

如：reut.pkl reut.txt reut_lable.txt 和reut1_graph.txt reut3_graph.txt reut5_graph.txt 之间的关系？reut1_graph.txt 是怎么由reut.txt 数据集得到的呢？

there is some troubles in pretrain.py

I have tried to pretrain the non-graph dataset usps, hhar with <pretrain.py>, however, the obtained pkl has low performance in <sdcn.py>,
so i'd like to ask how to get the correct pretrain file you have given.
thanks for your answer.

Results for citeseer and acm

Im having some trouble getting the results for citeseer and acm.

python sdcn.py --name cite gives me

#run 1
199Q :acc 0.6008 , nmi 0.3269 , ari 0.3326 , f1 0.5654
199Z :acc 0.6534 , nmi 0.3687 , ari 0.3591 , f1 0.5702
199P :acc 0.6008 , nmi 0.3269 , ari 0.3326 , f1 0.5654

#run 2
199Q :acc 0.5939 , nmi 0.3225 , ari 0.3255 , f1 0.5600
199Z :acc 0.6351 , nmi 0.3697 , ari 0.3804 , f1 0.5960
199P :acc 0.5939 , nmi 0.3225 , ari 0.3255 , f1 0.5600

#run 3
199Q :acc 0.6011 , nmi 0.3264 , ari 0.3340 , f1 0.5639
199Z :acc 0.6552 , nmi 0.3796 , ari 0.3988 , f1 0.6061
199P :acc 0.6011 , nmi 0.3264 , ari 0.3340 , f1 0.5639

python sdcn.py --name acm

#run 1
199Q :acc 0.8628 , nmi 0.5744 , ari 0.6364 , f1 0.8618
199Z :acc 0.8797 , nmi 0.6253 , ari 0.6792 , f1 0.8783
199P :acc 0.8628 , nmi 0.5744 , ari 0.6364 , f1 0.8618

#run 2
199Q :acc 0.8688 , nmi 0.5910 , ari 0.6515 , f1 0.8680
199Z :acc 0.8846 , nmi 0.6376 , ari 0.6915 , f1 0.8839
199P :acc 0.8688 , nmi 0.5910 , ari 0.6515 , f1 0.8680

Each time, scores seem to be difference from those in the paper. Is there probably some missing config or setting in the paper or the code?

graph.txt

what is the different between reut1_graph.txt ,reut3_graph.txt, reut5_graph.txt and reut10_graph.txt?
when i use 'run sdcn.py --name reut' , i found only 'reut3_graph.txt' is used.But in calcu_graph.py,the fname is 'graph/reut10_graph.txt'. if only one graph.txt is used for sdcn.py?

預訓練

您好，在預訓練時已經更改了輸入特征數目，可是還是顯示維度不匹配，請問怎麼修改呢？誠盼您回復

Hello, where is the raw data come from? Could you please answer?

batch_size

Thanks for your great work, but I have some questions about the training batch size.
Must data be entered all at once?
Can't we have batch_size?
If we have batchsize, what do we do with the adjacency matrix?

what about data without labels?

Do you need data labels(y) before calculating the clusters? I can see it as an argument for graph KNN

Lack of description about dataset for training

Hello
Thanks for contributing this paper and code.
But I feel the lack of description about dataset structure.
I am going to apply this to face clustering.
Is it possible for 5M face clustering too?
Could u explain it in detail?

normalize函数中与论文中公式有出入

在utils.py中normalize(mx)中对邻接矩阵的正则化计算：

def normalize(mx):
"""Row-normalize sparse matrix"""
rowsum = np.array(mx.sum(1))
r_inv = np.power(rowsum, -1).flatten()
r_inv[np.isinf(r_inv)] = 0.
r_mat_inv = sp.diags(r_inv)
mx = r_mat_inv.dot(mx)
return mx

它的意思是将度矩阵求导D-1 *A，而您的论文中是D-1/2 * A * D-1/2 (请忽略符号上没加小帽子)，即下面这个函数，它是GCN原文的对adj正则的代码

def normalize_adj(adj, self_loop=True):
"""Symmetrically normalize adjacency matrix."""
if self_loop:
adj = adj + sp.eye(adj.shape[0])
adj = sp.coo_matrix(adj)
rowsum = np.array(adj.sum(1))
d_inv_sqrt = np.power(rowsum, -0.5).flatten()
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.
d_mat_inv_sqrt = sp.diags(d_inv_sqrt)
return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()

虽然我看到DAEGC的代码也是上面的公式？