Giter VIP home page Giter VIP logo

lda2vec-pytorch's Introduction

lda2vec: Marriage of word2vec and lda

lda2vec_net

lda2vec

The lda2vec model tries to mix the best parts of word2vec and LDA into a single framework. word2vec captures powerful relationships between words, but the resulting vectors are largely uninterpretable and don't represent documents. LDA on the other hand is quite interpretable by humans, but doesn't model local word relationships like word2vec. We build a model that builds both word and document topics, makes them interpretable, makes topics over clients, times, and documents, and makes them supervised topics.

This repo is a pytorch implementation of Moody's lda2vec (implemented in chainer), a way of topic modeling using word embeddings.
The original paper: Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.

Warning:

As the authour said, lda2vec is a big series of experiments. It is still a research software. However, the author doesn't want to discourage experimentation!

I use vanilla LDA to initilize lda2vec (topic assignments for each document). It is not like in the original paper. Without this initial topic assignment, results are bad.

How to use it

  1. There is one important hyper parameter, N_TOPICS in prepare.py. Modify it as you want.
  2. Run python main.py to train.
  3. Go to explore/
  4. Run explore_model.ipynb to explore a trained model.

When you want to change N_TOPICS

  1. Remove all files in ./npy rm -f npy/*
  2. Remove all files in ./checkpoint rm -f checkpoint/*
  3. Run python main.py to train.

Training dataset description

I use 20newsgroups from sklearn datasets.

from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

To stem all tokens in the dataset, I use elasticsearch analyzer. Check that minimal_english stemmer is used here.

PUT l2v_analyzer_index
{
  "settings" : {
      "index" : {
        "analysis" : {
          "filter" : {
            "english_stemmer" : {
              "type" : "stemmer",
              "language" : "minimal_english"
            },
            "english_stop" : {
              "type" : "stop",
              "stopwords" : "_english_"
            }
          },
          "analyzer" : {
            "rebuilt_english" : {
              "filter" : [
                "lowercase",
                "english_stop",
                "english_stemmer"
              ],
              "tokenizer" : "standard"
            }
          }
        }
      }
    }
}

After 20newsgroups dataset from sklearn passed to above analyzer, contents are stored in data/newsgroups_texts.json

2017년 네이버 뉴스 (total 14GB) 이용하여 학습한 결과 탐색 {Exploration of a trained model using 2017 navernews (total 14GB) dataset}

14GB 텍스트 데이터에 학습된 모델을 탐색해보기 위해서 explore/explore_2017navernews.ipynb 확인 해주세요. 위의 노트북 파일은 읽기 전용입니다.

lda2vec-pytorch's People

Contributors

sebkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

matthewtoschi

lda2vec-pytorch's Issues

IndexError: list index out of range

when i try this code from module explore_model.ipynb
`for each, epoch in model_list2[:-1:(len(model_list2)//10)+1][:2] + [model_list2[-1]]:
state = torch.load(f'{each}', map_location=lambda storage, loc: storage)
doc_weights = state['doc_weights.weight'].cpu().clone().numpy()

# distribution over the topics for each document
topic_dist = softmax(doc_weights)

# distribution of nonzero probabilities
dist = topic_dist.reshape(-1)
plt.hist(dist[dist > 0.01], bins=40);
plt.title(f'epoch: {epoch}')
plt.show()`

, i'm getting this error :
`IndexError Traceback (most recent call last)
in
----> 1 for each, epoch in model_list2[:-1:(len(model_list2)//10)+1][:2] + [model_list2[-1]]:
2 state = torch.load(f'{each}', map_location=lambda storage, loc: storage)

IndexError: list index out of range`
how to fix this plz ??

AttributeError: 'list' object has no attribute 'seek'

hello,i train your data,but when i want to test the model,AttributeError: 'list' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.
how i fix this error

trainning lda2vec with my own data

i wanted to train this model with my own data,but I didn't know how to do it?? which modules I have to modify?? please nobody can help me it's urgent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.