Giter VIP home page Giter VIP logo

plsa-1's Introduction

#plsa

This is a tiny PLSA tool written in Java, you can use or modify it as you wish.

1. Compile the project

This project is maintained by Ant. If you want to re-compile it, just execute "./build.sh" in your termial cmdline.

The ".jar" file is already included in the "dist/lib" directory, so you can also use this tool without re-compiling.

2. How to use this tool

There are only 2 shell scripts under the root folder of this project:

process

plsa

2.1 Usage

2.1.1 process

process is used for pre-processing the raw corpus. It will transfer the raw corpus files into the sparse matrix file for PLSA training.

Usage

./process -corpusfolder <folder> -stopwordfile <file> -vocfile <file> -matrixfile <file> -lowfreq <int>

2.1.2 plsa

plsa is used for PLSA training.

Usage

./plsa -matrixfile <file> -vocfile <file> -topic <int> -iter <int> -eps <double> -topk <int> -resultfolder <folder>

2.2 Parameters of process and plsa

-corpusfolder: raw corpus folder.

-stopwordfile: stopwords vocabulary file.

-vocfile: the file for the generated vocabulary of corpus.

-matrixfile: the matrix file generated, this file will be used for train.

-lowfreq: low freq(<lowfreq) word will be discarded.

-topic: the topic #.

-iter: the max train iterate times.

-eps: early stop threshold(tool will sotp when the likelihood diff of 2 iterate < eps).

-topk: the topK words of a topic will show.

-resultfolder: train reslut folder.

3. Input Data Format

Plain english text files.

4. Output Data Format

4.1. Files generated by process

4.1.1 maxtrixfile is sparse matrix for raw corpus

file_0 word:freq word:freq ...

file_1 word:freq word:freq ...

...

file_D word:freq word:freq ...

file_i is the full pathname of the i-th file

4.1.2 vocfile includes the vocabulary got from the corpus

word0

word1

...

wordV

file_i is the full file name of the i-th file

4.2. Files generated by plsa

4.2.1 p_z is for p(z)

prob_z0 prob_z1 ... prob_zK

4.2.2 p_d_z is for p(d|z)

p(d0|z0) p(d0|z1)... p(d0|zK)

p(d1|z0) p(d1|z1)... p(d1|zK)

...

p(dD|z0) p(dD|z1)... p(dD|zK)

4.2.3 p_w_z is for p(w|z)

p(w0|z0) p(w0|z1)... p(w0|zK)

p(w1|z0) p(w1|z1)... p(w1|zK)

...

p(wV|z0) p(wV|z1)... p(wV|zK)

4.2.4 docTopics is for p(z|d)

p(z0|d0) p(z1|d0)... p(zK|d0)

p(z0|d1) p(z1|d1)... p(zK|d1)

...

p(z0|dD) p(z1|dD)... p(zK|dD)

4.2.5 wordTopics is for p(z|w)

p(z0|w0) p(z1|w0)... p(zK|w0)

p(z0|w1) p(z1|w1)... p(zK|w1)

...

p(z0|wV) p(z1|wV)... p(zK|wV)

4.2.6 topicTopKWords is for top K words of each topic

[Topic0]

[word0]

[word1]

...

plsa-1's People

Contributors

david2dai avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.