Giter VIP home page Giter VIP logo

tsvd's Introduction

Thresholded SVD based K-means for Topic Modeling

Code for the paper:
"A Provable SVD-based Algorithm for Learning Topics in Dominant Admixture Corpus". Trapit Bansal, Chiranjib Bhattacharyya, Ravindran Kannan. In Neural Information Processing Systems (NIPS), 2014.


This file provides useful information for using the code. First we show how to use the demos for running the algorithm on a variety of standard public datasets. To run on your own data, check the following data format, pre-processing steps and the main matlab function for running the algorithm. Provided code has been tested on Matlab R2012a/b on a Linux system.

For bugs/queries/suggestions/thanks feel free to email me at "trapitbansal at gmail dot com"

Running Demos

Open Matlab with the current directory as the code directory, make sure Matlab is configured for internet access and type:
demo()
This will run a demo of the algorithm on the NIPS corpus.

This function can recover topics from a specified public corpus using TSVD. Other available choices for the corpus are the 20-NewsGroup and any corpus on the UCI repository (that is NIPS, ENRON, KOS, NYT, PUBMED). Specify the corpus name as input variable, the possible values are: "nips", "enron", "kos", "20NG" (default is nips). The function downloads corresponding data, processes it, runs TSVD and prints the topics. If you have problems downloading the data from matlab, create a folder called "demo_[corpus]" where [corpus] is the corpus name as above and put the unzipped data file and the vocabulary file in that folder. Then call demo([corpus]). The demo script calls the python code to preprocess the data.
Note: Using "nytimes" or "pubmed" as corpus name is also possible though the datasets are quite big and memory issues may arise, so use these at your own risk (nytimes works with around 4GB memory if you have some small swap space available and close all other running programs, pubmed has not been tested).

Data Format

For the TSVD function, input data should be in a text file in matlab sparse matrix format, that is each line of input text file is:
Doc_id Word_id Count
where Doc_id and Word_id start from 1.
The UCI data is in the same format with 3 addtional header-lines at the start which need to be removed.

Pre-processing Data

It is recommended to preprocess the data before running the algorithm. The provided python code is a helper script to remove words from a standard list of stop-words and truncate vocabulary based on term-frequency. Truncating vocabulary is good for compuational/memory efficiency, and also gives better quality topics. To run the script type:
python process_data.py <inputData> <inputVocab> <stopwords> <vocab> <outputName>

where

  • inputData is in the input data file to be processed in sparse matrix format
  • inputVocab is vocabulary-words file where i-th line contains the i-th indexed word in inputData
  • stopwords is list of stopwords (one per line) - a standard list is provided though you can use your own
  • vocab is the desired vocabulary truncation
  • outputName is the location (including name) of the output processed file (the code will append ".proc.txt" at end to ensure there is no name conflict)
    One can also use UCI format input data for this code, just pass the integer 1 as an additional last argument.

Running TSVD Algorithm

The matlab function signature is:
[M_hat, dominantTopic] = TSVD(infile, outpath, K)
where

  • infile is the path to the data file (in above format),
  • outpath is the output folder name where topic matrix will be written
  • K is the number of topics
  • M_hat is the topic matrix (vocab*K)
  • dominantTopic is the dominant topic assignment for each document Type help TSVD in matlab for more info on additional parametrers

Printing Topics

The matlab function signature is:
print_topics(M_hat, vocabfile, numtop)
where

  • M_hat is the output topic matrix from TSVD (vocab*K)
  • vocabfile is the vocabulary-word file where i-th line contains the i-th indexed word in inputData
  • numtop is the number of top words to print per topic

tsvd's People

Contributors

thetb avatar

Watchers

James Cloos avatar kaveti avatar sridhar bandlamudi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.