Giter VIP home page Giter VIP logo

stc2's Introduction

@article{xu2017self,
title={Self-Taught Convolutional Neural Networks for Short Text Clustering},
author={Xu, Jiaming and Xu, Bo and Wang, Peng and Zheng, Suncong and Tian, Guanhua and Zhao, Jun and Xu, Bo},
journal={Neural Networks},    
volume={88},
pages={22-31},
year={2017}
}

Note that:

Here are instructions of the demo dataset&software for the paper [Self-Taught Convolutional Neural Networks for Short Text Clustering]

Usage:

  1. Please download the software and dataset packages, and put them into one folder;
  2. The main function: ./software/main_STC2.m, please first "cd ./software/" and then run main_STC2.m via matlab;

Notices:

  1. The suggested memory of machine is 16GB RAM;
  2. The suggested matlab version is R2011 and above;
  3. This is a demo package which includes the all details about porposed method and baselines;
  4. K-means clustering is very slow on original high-dimensionality (2W~3W dim.) text features;
    If you want to run clustering via Kmeans, please have a little patience, and we strongly suggest that you directly refer the KMeans results in our paper which reports the average results by running KMeans 500 times;
  5. Please feel free to send me emails if you have any problems in using this package.

Instructions of Archives:

./README.md: Some notices and instructions.
./dataset/

-- Biomedical.txt: the raw 20,000 short text;
-- Biomedical_gnd.txt: the labels;
-- Biomedical_vocab2idx.dic: vocabulary index;
-- Biomedical_index.txt: has transfered the words into idx;
-- Biomedical-lite.mat: mini dataset only including feature vectors (fea) and labels (gnd);
-- Biomedical-STC2.mat: dataset for STC^2, including 20,000 short texts, 20 topics/tags and the pre-trained word embeddings;
-- SearchSnippets.txt: the raw 12,340 short text;
-- SearchSnippets_vocab2idx.dic: vocabulary index;
-- SearchSnippets_index.txt: has transfered the words into idx;
-- SearchSnippets-lite.mat: mini dataset only including feature vectors (fea) and labels (gnd);
-- SearchSnippets-STC2.mat: dataset for STC^2, including 12,340 short texts, 8 topics/tags and the pre-trained word embeddings;
-- StackOverflow.txt: the raw 20,000 short text;
-- StackOverflow_gnd.txt: the labels;
-- StackOverflow_vocab2idx.dic: vocabulary index;
-- StackOverflow_index.txt: has transfered the words into idx;
-- StackOverflow-lite.mat: mini dataset only including feature vectors (fea) and labels (gnd);
-- StackOverflow-STC2.mat: dataset for STC^2, including 20,000 short texts, 20 topics/tags and the pre-trained word embeddings;

./software/: Main folder of software;

-- main_STC2.m: main function, and select one clustering method here: Kmeans, RecNN, AveEmbedding, LSA, Spectral_LE, etc.;
-- run.sh: running it on commond line for linux user rather than window user;
-- STC2.m: interfaces of clustering methods;
-- STC2_CNN.m: interfaces of DCNN;
-- AE/: Average Embedding (AE) folder;
-- DCNN/: Dynamic Convolutional Neural Network (DCNN)[1] folder;
-- LE/: Laplacian Eigenmaps (LE)[2] folder;
-- LPI/: Locality Preserving Indexing (LPI)[3] folder;
-- LSA/: Latent Semantic Analysis (LSA)[4] folder;
-- Para2vec/: Paragraph vector (Para2vec)[5] folder;
-- RecNN/: Recursive Neural Network (RecNN)[6] folder;
-- results/: All evaluate results (ACC and NMI) of clustering will be saved in this folder;
-- tools/: Tool folder;
-- benchmarks/: Contains some classification benchmarks, SVM-linear or SVM-RBF on TF, TFIDF or AE. Get more classification details into this folder.

References:

[1]. N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, ACL, 2014.
[2]. M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, NIPS, 2001.
[3]. D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering, 2005.
[4]. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, R. A. Harshman, Indexing by latent semantic analysis, JAsIs, 1990.
[5]. Q. Le, T. Mikolov, Distributed representations of sentences and documents, ICML, 2014.
[6]. R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, C. D. Manning, Semisupervised recursive autoencoders for predicting sentiment distributions, EMNLP, 2011.

stc2's People

Contributors

jacoxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stc2's Issues

Deep feature representation h是一个向量吗

我在论文中看到:The network defines a trasnformation
image which transforms an raw input text x to r-dimensional deep representation h.
这是不是就意味着最后的h是一个r维的向量呢?
因为注意到文章在超参数一节中写道CNN的两层**使用了两次Folding,我疑惑的是,48行的矩阵输入经过两次folding是不是就应该变成12行矩阵呢,想请问最后是怎么成为h向量的呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.