Giter VIP home page Giter VIP logo

sdcn's Introduction

SDCN

Structural Deep Clustering Network

Paper

https://arxiv.org/abs/2002.01633

https://github.com/461054993/SDCN/blob/master/SDCN.pdf

Dataset

Due to the limitation of file size, the complete data can be found in

Baidu Netdisk:

graph: 链接:https://pan.baidu.com/s/1MEWr1KyrtBQndVNy8_y2Lw 密码:opc1

data: 链接:https://pan.baidu.com/s/1kqoWlElbWazJyrTdv1sHNg 密码:1gd4

Google Drive:

graph: https://drive.google.com/file/d/10rnVwIAuVRczmZJSX7mpSTR0-HVnMWLh/view?usp=sharing

data: https://drive.google.com/file/d/1VjH6xqt82GaQwwiy-4O2GedMgQMLN6dm/view?usp=sharing

Code

python sdcn.py --name [usps|hhar|reut|acm|dblp|cite]

Q&A

  • Q: Why do not use distribution Q to supervise distribution P directly?
    A: The reasons are two-fold: 1) Previous method has considered to use the clustering assignments as pseudo labels to re-train the encoder in a supervised manner, i.e., deepCluster. However, in experiment, we find that the gradient of cross-entropy loss is too violent to prevent the embedding spaces from disturbance. 2) Although we can replace the cross-entropy loss with KL divergence, there is still a problem that we worried about, that is, there is no clustering information. The original intention of our research on deep clustering is to integrate the objective of clustering into the powerful representation ability of deep learning. Therefore, we introduce the distribution P to increase the cohesion of clustering performance, the details can be found in DEC.

  • Q: How to apply SDCN to other datasets?
    A: In general, if you want to apply our model to other datasets, three steps are required.

    1. Construct the KNN graph based on the similarity of features. Details can be found in calcu_graph.py.
    2. Pretrain the autoencoder and save the pre-trained model. Details can be found in data/pretrain.py.
    3. Replace the args in sdcn.py and run the code.

Reference

If you make advantage of the SDCN model in your research, please cite the following in your manuscript:

@inproceedings{sdcn2020,
  author    = {Deyu Bo and
               Xiao Wang and
               Chuan Shi and
               Meiqi Zhu and
               Emiao Lu and
               Peng Cui},
  title     = {Structural Deep Clustering Network},
  booktitle = {{WWW}},
  pages     = {1400--1410},
  publisher = {{ACM} / {IW3C2}},
  year      = {2020}
}

sdcn's People

Contributors

bdy9527 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sdcn's Issues

how to pre-train the model?

I follow the parameters setting in the paper, to pre-train the model. However, the final result is not very good.
Can you post the pre-training code?

AE预训练文件

您好,有一些问题向您请教。
我在学习您代码时发现通过您提供的预训练文件预训练的AE模型在实验中的聚类指标比较差,请问您是如何得到的AE预训练的模型的,方便帮帮我吗

How to ‘construct_graph’ ?

感谢您的工作,我在看代码的时候发现在calcu_graph.py中进行construct_graph时需要用到标签数据,请问对于无标签的非graph数据或者无标签的graph数据,应该怎样去构造graph?

实验结果

请问作者您是先训练好一个效果不错的ae模型,再对SDCN进行训练。还是每次实验都重新预训练AE,再训练SDCN。

Eq(7)?

Hi, I can not find where is Eq(7) in your codes.
image

loss函数

论文中数据集得到的loss函数不是特别理解,是不是存在相关问题

Graph construct

image

why do you add this line before computing distance?
features[features > 0] = 1

K-means中的K以及训练过程中的参数更新问题

首先非常感谢您的开源,我从您的工作中学到了很多!同时我也有两个问题想向您请教。
1)K的设置问题。我看您在文中讨论了KNN算法中K的设置问题,请问一下有没有讨论在K-means聚类时K的指定问题呢?还是说这里的K直接使用数据集中给出来的真实类别数了?
2)训练过程中聚类中心 $\mu$ 的更新问题。我看您在伪代码中没有显示地指出来聚类中心 $\mu$是否会随着训练更新,我看了DEC和IDEC的论文,发现是要更新的;您在代码里面也是把聚类中心作为模型参数了,所以想确认一下,$\mu$ 是会更新的是吧?
谢谢!

about graph

目前有以下问题,期待您的回答
(1)按我的理解,acm、cite等也是一种graph数据,为什么本身是一种graph,也会出现acm_graph、cite_graph数据呢?
(2)在graph中,每行的两个数据,例如 31,2016 是代表第31和2016个样本之间是有edge的关系吗?

eva函数存在问题

在evaluation.py文件里面的评价函数eva,调用了自己写的acc/f1函数,里面存在很大的问题
代码中的acc函数如下
image
在两者类不匹配的时候,会去改写pred的结果,而且这是直接修改原始数据,会导致后续的nmi/ari计算使用被修改了的pred数据,使得acc/nmi/ari/f1这四个评估指标与真实存在差异
比如如下,使用sklearn.metrics提供的指标计算函数
image
在自行构造的特殊场景下,有着很大的差别,假设存在100个点,并且类别都不一样,其原始分类为[0, ..., 100],现在假设全都预测为同一类,此时正常的eval应该是很差的,但是evaluation.py里的eva函数的数据结果却为极好
image

some questions

your parper have a balance value of 0.5 , however your code this value=1,Can you tell me why?

question about the dataset (ACM)

Thanks for your great work.
I have a small question about your ACM dataset. I noticed you mentioned in the paper that you "divide the papers into three classes (database, wireless communication, data mining) by their research area". Does that mean the "database" is labelled as 0, "wireless communication" is 1, "data mining" is 2.

Just wanna better understand the graph and the results. Thx

data和graph数据集之间的关系?

你好请问data文件夹和graph文件夹里对应数据集的关系是什么呢?
image
image
如:reut.pkl reut.txt reut_lable.txt 和reut1_graph.txt reut3_graph.txt reut5_graph.txt 之间的关系 ?reut1_graph.txt 是怎么由reut.txt 数据集得到的呢?

there is some troubles in pretrain.py

I have tried to pretrain the non-graph dataset usps, hhar with <pretrain.py>, however, the obtained pkl has low performance in <sdcn.py>,
so i'd like to ask how to get the correct pretrain file you have given.
thanks for your answer.

Results for citeseer and acm

Im having some trouble getting the results for citeseer and acm.

python sdcn.py --name cite gives me

#run 1
199Q :acc 0.6008 , nmi 0.3269 , ari 0.3326 , f1 0.5654
199Z :acc 0.6534 , nmi 0.3687 , ari 0.3591 , f1 0.5702
199P :acc 0.6008 , nmi 0.3269 , ari 0.3326 , f1 0.5654

#run 2
199Q :acc 0.5939 , nmi 0.3225 , ari 0.3255 , f1 0.5600
199Z :acc 0.6351 , nmi 0.3697 , ari 0.3804 , f1 0.5960
199P :acc 0.5939 , nmi 0.3225 , ari 0.3255 , f1 0.5600

#run 3
199Q :acc 0.6011 , nmi 0.3264 , ari 0.3340 , f1 0.5639
199Z :acc 0.6552 , nmi 0.3796 , ari 0.3988 , f1 0.6061
199P :acc 0.6011 , nmi 0.3264 , ari 0.3340 , f1 0.5639

python sdcn.py --name acm

#run 1
199Q :acc 0.8628 , nmi 0.5744 , ari 0.6364 , f1 0.8618
199Z :acc 0.8797 , nmi 0.6253 , ari 0.6792 , f1 0.8783
199P :acc 0.8628 , nmi 0.5744 , ari 0.6364 , f1 0.8618

#run 2
199Q :acc 0.8688 , nmi 0.5910 , ari 0.6515 , f1 0.8680
199Z :acc 0.8846 , nmi 0.6376 , ari 0.6915 , f1 0.8839
199P :acc 0.8688 , nmi 0.5910 , ari 0.6515 , f1 0.8680

Each time, scores seem to be difference from those in the paper. Is there probably some missing config or setting in the paper or the code?

graph.txt

what is the different between reut1_graph.txt ,reut3_graph.txt, reut5_graph.txt and reut10_graph.txt?
when i use 'run sdcn.py --name reut' , i found only 'reut3_graph.txt' is used.But in calcu_graph.py,the fname is 'graph/reut10_graph.txt'. if only one graph.txt is used for sdcn.py?

預訓練

您好,在預訓練時已經更改了輸入特征數目,可是還是顯示維度不匹配,請問怎麼修改呢?誠盼您回復

batch_size

Thanks for your great work, but I have some questions about the training batch size.
Must data be entered all at once?
Can't we have batch_size?
If we have batchsize, what do we do with the adjacency matrix?

Lack of description about dataset for training

Hello
Thanks for contributing this paper and code.
But I feel the lack of description about dataset structure.
I am going to apply this to face clustering.
Is it possible for 5M face clustering too?
Could u explain it in detail?

normalize函数中与论文中公式有出入

在utils.py中normalize(mx)中对邻接矩阵的正则化计算:

def normalize(mx):
"""Row-normalize sparse matrix"""
rowsum = np.array(mx.sum(1))
r_inv = np.power(rowsum, -1).flatten()
r_inv[np.isinf(r_inv)] = 0.
r_mat_inv = sp.diags(r_inv)
mx = r_mat_inv.dot(mx)
return mx

它的意思是将度矩阵求导D-1 *A,而您的论文中是D-1/2 * A * D-1/2 (请忽略符号上没加小帽子),即下面这个函数,它是GCN原文的对adj正则的代码

def normalize_adj(adj, self_loop=True):
"""Symmetrically normalize adjacency matrix."""
if self_loop:
adj = adj + sp.eye(adj.shape[0])
adj = sp.coo_matrix(adj)
rowsum = np.array(adj.sum(1))
d_inv_sqrt = np.power(rowsum, -0.5).flatten()
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.
d_mat_inv_sqrt = sp.diags(d_inv_sqrt)
return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()

虽然我看到DAEGC的代码也是上面的公式?

self.ae = AE()里的参数

self.ae = AE(
n_enc_1=n_enc_1,
n_enc_2=n_enc_2,
n_enc_3=n_enc_3,
n_dec_1=n_dec_1,
n_dec_2=n_dec_2,
n_dec_3=n_dec_3,
n_input=n_input,
n_z=n_z)
你好请问这段代码里的参数是从AE中经过全连接处理过的特征吗?,然后再将这些特征输入到GCN对应的层里面?

About run env

hello, could you please provide requirements.txt? and the Pytorch 's version?

预训练步骤

作者您好,我在试图将SDCN框架运用到我自己的数据集上遇到一些问题,想向您请教。(无标签+非graph数据)
根据您之前说的将SDCN运用到其他数据集的三个步骤,第一步,创建自己数据集的KNN图,这一步我已经完成。第二步预训练步骤,我所输入的x的shape为[8500,20000],然后根据代码,需要修改model=AE{}这一部分的代码,在CPU/GPU上训练都提示大小不匹配的问题,即使将预训练的网络修改成shape相同的,同样还是无法训练。

数据集编码

作者您好,您提供的数据集是编码后的,请问你对acm等数据集是如何编码的呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.