Self-Taught Convolutional Neural Networks for Short Text Clustering

@article{xu2017self,
title={Self-Taught Convolutional Neural Networks for Short Text Clustering},
author={Xu, Jiaming and Xu, Bo and Wang, Peng and Zheng, Suncong and Tian, Guanhua and Zhao, Jun and Xu, Bo},
journal={Neural Networks},
volume={88},
pages={22-31},
year={2017}
}

Note that:

Here are instructions of the demo dataset&software for the paper [Self-Taught Convolutional Neural Networks for Short Text Clustering]

Usage:

Please download the software and dataset packages, and put them into one folder;

The main function: ./software/main_STC2.m, please first "cd ./software/" and then run main_STC2.m via matlab;

Notices:

The suggested memory of machine is 16GB RAM;

The suggested matlab version is R2011 and above;

This is a demo package which includes the all details about porposed method and baselines;

K-means clustering is very slow on original high-dimensionality (2W~3W dim.) text features;
If you want to run clustering via Kmeans, please have a little patience, and we strongly suggest that you directly refer the KMeans results in our paper which reports the average results by running KMeans 500 times;

Please feel free to send me emails if you have any problems in using this package.

Instructions of Archives:

./README.md: Some notices and instructions.
./dataset/

-- Biomedical.txt: the raw 20,000 short text;
-- Biomedical_gnd.txt: the labels;
-- Biomedical_vocab2idx.dic: vocabulary index;
-- Biomedical_index.txt: has transfered the words into idx;
-- Biomedical-lite.mat: mini dataset only including feature vectors (fea) and labels (gnd);
-- Biomedical-STC2.mat: dataset for STC^2, including 20,000 short texts, 20 topics/tags and the pre-trained word embeddings;
-- SearchSnippets.txt: the raw 12,340 short text;
-- SearchSnippets_vocab2idx.dic: vocabulary index;
-- SearchSnippets_index.txt: has transfered the words into idx;
-- SearchSnippets-lite.mat: mini dataset only including feature vectors (fea) and labels (gnd);
-- SearchSnippets-STC2.mat: dataset for STC^2, including 12,340 short texts, 8 topics/tags and the pre-trained word embeddings;
-- StackOverflow.txt: the raw 20,000 short text;
-- StackOverflow_gnd.txt: the labels;
-- StackOverflow_vocab2idx.dic: vocabulary index;
-- StackOverflow_index.txt: has transfered the words into idx;
-- StackOverflow-lite.mat: mini dataset only including feature vectors (fea) and labels (gnd);
-- StackOverflow-STC2.mat: dataset for STC^2, including 20,000 short texts, 20 topics/tags and the pre-trained word embeddings;

./software/: Main folder of software;

-- main_STC2.m: main function, and select one clustering method here: Kmeans, RecNN, AveEmbedding, LSA, Spectral_LE, etc.;
-- run.sh: running it on commond line for linux user rather than window user;
-- STC2.m: interfaces of clustering methods;
-- STC2_CNN.m: interfaces of DCNN;
-- AE/: Average Embedding (AE) folder;
-- DCNN/: Dynamic Convolutional Neural Network (DCNN)[1] folder;
-- LE/: Laplacian Eigenmaps (LE)[2] folder;
-- LPI/: Locality Preserving Indexing (LPI)[3] folder;
-- LSA/: Latent Semantic Analysis (LSA)[4] folder;
-- Para2vec/: Paragraph vector (Para2vec)[5] folder;
-- RecNN/: Recursive Neural Network (RecNN)[6] folder;
-- results/: All evaluate results (ACC and NMI) of clustering will be saved in this folder;
-- tools/: Tool folder;
-- benchmarks/: Contains some classification benchmarks, SVM-linear or SVM-RBF on TF, TFIDF or AE. Get more classification details into this folder.

References:

[1]. N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, ACL, 2014.
[2]. M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, NIPS, 2001.
[3]. D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering, 2005.
[4]. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, R. A. Harshman, Indexing by latent semantic analysis, JAsIs, 1990.
[5]. Q. Le, T. Mikolov, Distributed representations of sentences and documents, ICML, 2014.
[6]. R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, C. D. Manning, Semisupervised recursive autoencoders for predicting sentiment distributions, EMNLP, 2011.

jacoxu / stc2 Goto Github PK

stc2's Introduction

Self-Taught Convolutional Neural Networks for Short Text Clustering

Note that:

stc2's People

Contributors

Stargazers

Watchers

Forkers

stc2's Issues

Deep feature representation h是一个向量吗

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent