Giter VIP home page Giter VIP logo

integrating_topics_and_syntax's Introduction

Implementation of the paper - "Integrating Topics and Syntax"

Link to the paper

Installation

Setup an environment in conda or pip and install the below packages

  • python=3.10
  • ipykernel
  • jupyter
  • scikit-learn
  • spacy
  • pandas
  • matplotlib
  • nltk
  • gensim
  • tqdm

Setting up the dataset

Preprocessing

python preprocess_data.py
  --size <num-of-docs>
  --dataset <options - news/nips>

The above command would preprocess your datasets, and writes the vocab and the document as a list of token ids, into a folder named {dataset}_{num-docs}

Training the model

python main.py
  --alpha <document specific topic distribution's symmetric Dirichlet parameter> 
  --beta <topic specific word distribution> 
  --delta <document specific topic distribution>
  --gamma <distribution of transition between classes>
  --num_iter <iterations of gibbs sampling>
  --num_topics <T>
  --num_classes <C>
  --dataset <path-to-preprocessed-dataset>

The model output files are written to the folder - out/{alpha}_{beta}_{gamma}_{delta}_{num_topics}_{num_classes}_{num_iterations}_{dataset}

Evaluation

Document classification

To run the document classification on newsgroup dataset.

python doc_classifier.py
  --theta_file <path-to-theta.txt>
  --skip_indices_file <path-to-skipped_indices.txt>
  --train_test_split <split-fraction>

Where the theta.txt contains the document topic counts. And the skipped_indices.txt files contains the list of indices of documents you skipped when training our model.

Topic Coherence score

Creates a plot of topic coherence score against iterations calculated using gensim. One must supply correct directory containing phi_z.txt file.

python metrics.py

Pure LDA

Trains a ldamodel in gensim on supplied data. Creates a plot of coherence against number of topics.

python pure_lda.py

References

integrating_topics_and_syntax's People

Contributors

mikalatte avatar abhijithasokan avatar anbilly19 avatar lkamilla avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.