Giter VIP home page Giter VIP logo

metalda's Introduction

MetaLDA

MetaLDA is a topic model that leverages either document or word meta information, or both of them jointly [1].

Key features:

  1. Incorporates both document and word meta information in binary format
  2. Implemented on top of Mallet in JAVA
  • Works with Mallet input format
  • Runs efficiently (bit-coding and SparseLDA framework apply)
  • Runs with multi-threads (DistributedLDA framework applies)

Run MetaLDA

  1. Requirements: JDK 1.8 or later and Maven. Other dependencies will be downloaded by Maven.
  2. Clone the repository or download the code
  3. Compile the code with Maven
  • cd <metalda_location>
  • mvn package
  1. Prepare documents

    • All documents (training/testing) are in Mallet's LabeledLDA format.
    • If the input documents are already in Mallet, they can be directly fed into the model.
    • Otherwise, the documents have to be first converted into Mallet format.
      • Each raw document should in the following format:
        DOC_ID\tLABEL1\sLABEL2\sLABEL3\tWORD1\sWORD2\sWORD3\n.
      • Install Mallet then use:
        <mallet_location>/bin/mallet import-file --input <training/testing_doc_location> --output <training/testing_doc_mallet_location> --label-as-features --keep-sequence --line-regex '([^\t]+)\t([^\t]+)\t(.*)'
  2. Prepare word features

    • MetaLDA uses the following sparse representation of binary word features:
      WORD\tNNZ_INDEX1\sNNZ_INDEX2\sNNZ_INDEX3
    • Use embeddings as word features
      • MetaLDA offers a function to binarise and convert word embeddings into the required word feature format. The raw input word embeddings are expected to follow the format of GloVe:
        WORD\sEMBEDDING1\sEMBEDDING2\sEMBEDDING3
      • To binarise and convert the raw word embeddings, in the root folder of MetaLDA, use:
        java -cp ./target/metalda-0.1-jar-with-dependencies.jar topicmodels.BinariseWordEmbeddings --train-docs <training_doc_mallet_location> --test-docs <testing_doc_mallet_location> --input <raw_embedding_location> --output <binary_embedding_location>
      • The function first reads the vocabularies of the training and testing documents (both in Mallet format) and then binarise the embeddings of the words in the vocabularies stored in the word embedding file, and finally saves the binarised embeddings into the required format. Note that MetaLDA does not require all the words in the training and testing documents have embeddings.
  3. Train MetaLDA
    java -cp ./target/metalda-0.1-jar-with-dependencies.jar topicmodels.MetaLDATrain --train-docs <training_doc_mallet_location> --num-topics <num_topic> --word-features <binary_embedding_location> --save-folder <save_folder> --sample-alpha-method <sample_alpha_method> --sample-beta-method <sample_beta_method>

  • <sample_alpha_method>:
    • 0: fixed on initial value
    • 1: alpha is a full matrix sampled with doc labels
    • 2: alpha is sampled as an asymmetric vector over topic
    • 3: alpha is sampled as a single value
  • <sample_beta_method>:
    • 0: fixed on initial value
    • 1: beta is a full matrix sampled with word features
    • 2: beta is sampled as an asymmetric vector over topics
    • 3: beta is sampled as a single value
  • For details of the arguments, use:
    java -cp ./target/metalda-0.1-jar-with-dependencies.jar topicmodels.MetaLDATrain --help
  1. Access the saved files in the training phrase
    In the training phrase, MetaLDA saves the following files in the <save_folder>:
  • top_words.txt:
    the top 50 words with the largest weights (phi) in each topic (the number of top words can be changed)
  • train_alphabet.txt:
    the vocabulary of the training documents, the order of the words matches the index of phi.
  • train_target_alphabet.txt:
    the vocabulary of the labels in the training documents, the order of the labels matches the index of lambda
  • train_stats.mat:
    a MAT-file of Matlab that saves the training statistics. matfilerw for JAVA and FileIO of Scipy for Python are good tools to access MAT-files. Note that Matlab installation is not required although Matlab can directly load MAT-files.
  1. Inference on the testing documents
    MetaLDA offers two kinds of inference:
  • Ignore the words that exist in the testing documents but not in the training documents
    java -cp ./target/metalda-0.1-jar-with-dependencies.jar topicmodels.MetaLDAInfer --test-docs <testing_doc_mallet_location> --save-folder <save_folder> --compute-perplexity true
    • <save_folder>: same to the folder where the files are saved in the training phrase
    • --compute-perplexity
      • true: MetaLDA will use one half of each testing document (every first words) to sample its document-topic distribution (theta) and the other half (every second words) to compute perplexity.
      • false: MetaLDA will use all the content of each testing document to sample its document-topic distribution (theta). Perplexity will not be computed.
  • Consider the words that exist in the testing documents but not in the training documents
    java -cp ./target/metalda-0.1-jar-with-dependencies.jar topicmodels.MetaLDAInferUnseen --test-docs <testing_doc_mallet_location> --save-folder <save_folder> --compute-perplexity true --word-features <binary_embedding_location>
  1. Access the saved files in the inference phrase
  • If MetaLDAInfer is used, MetaLDA will save the testing statistics into 'test_stats.mat' in <save_folder>
  • If MetaLDAInferUnseen is used, MetaLDA will save the testing statistics into 'test_stats_unseen.mat' in <save_folder>

Demo

To wrap up the above steps, a demo bash script is offered. To run the demo, simply go to the demo folder and run ./demo.sh. For the demo, the WS dataset used in the paper is included.

References

[1] H. Zhao, L. Du, W. Buntine, G. Liu, "MetaLDA: a Topic Model that Efficiently Incorporates Meta information", International Conference on Data Mining (ICDM) 2017. Arxiv

If you find any bugs, please contact [email protected].

metalda's People

Contributors

dulann avatar ethanhezhao avatar wbuntine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

metalda's Issues

An error will be reported in the second inference mode when running demo.sh and using my own data!

Hello, my name is Lee. I follow your step in Readme, but when I changed the training, test and raw embeddings data in the demo to my own data, the following error occurred.

Exception in thread "main" java.lang.IllegalStateException: Topic Inferencer: New topic not sampled.

The above problem arises when considering the words that exist in the testing documents but not in the training documents. The detailed error message is as follows:

Couldn't open cc.mallet.util.MalletLogger resources/logging.properties file.
Perhaps the 'resources' directories weren't copied into the 'class' directory.
Continuing.
java.lang.reflect.InaccessibleObjectException: Unable to make public jdk.internal.ref.Cleaner java.nio.DirectByteBuffer.cleaner() accessible: module java.base does not "opens java.nio" to unnamed module @2c6a3f77
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Method.checkCanSetAccessible(Method.java:199)
at java.base/java.lang.reflect.Method.setAccessible(Method.java:193)
at com.jmatio.io.MatFileReader$1.run(MatFileReader.java:374)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
at com.jmatio.io.MatFileReader.clean(MatFileReader.java:366)
at com.jmatio.io.MatFileReader.read(MatFileReader.java:321)
at com.jmatio.io.MatFileReader.(MatFileReader.java:154)
at com.jmatio.io.MatFileReader.(MatFileReader.java:102)
at topicmodels.MetaLDAInferUnseen.main(MetaLDAInferUnseen.java:673)
Exception in thread "main" java.lang.IllegalStateException: Topic Inferencer: New topic not sampled.
at topicmodels.MetaLDAInferUnseen.getSampledDistribution(MetaLDAInferUnseen.java:266)
at topicmodels.MetaLDAInferUnseen.getInferredDistributions(MetaLDAInferUnseen.java:482)
at topicmodels.MetaLDAInferUnseen.main(MetaLDAInferUnseen.java:791)
inference with unseen words finished ...

when ignoring the words that exist in the testing documents but not in the training documents, it can run. But in the second inference mode, an error will be reported. Could you please help me?
Thanks very much!!!!!

Some trouble during transfer .txt to .mallet fomat

Hello, my name is Apc. I follow your step in Readme, but when i enter this in cmd:

<mallet_location>/bin/mallet import-file --input train_doc.txt --output train_doc.mallet --label-as-features --keep-sequence --line-regex '([^\t]+)\t([^\t]+)\t(.*)'

It always said that Exception in thread "main" java .lang.IllegalStateException : Line #1 does not match regex.
The dataset i try is in this project data/WS/train_doc.txt, but it doesn't work. Have you meet this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.