Giter VIP home page Giter VIP logo

doc2vec's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doc2vec's Issues

Forked version of gensim

I understand that in order to use pretrained word embeddings to train the doc2vec models, we should install your forked version of gensim. However, I have failed to configure it properly with a C compiler. (MINGW in my case, could be BLAS etc.) I'm on windows and am using Anaconda.

I tried installing straight from setup.py, using pip, and even created a conda package and installed using that. Each time, I did succeed in installing the forked version of gensim.

However, every time I tried the following commands in a python shell-

>>> import gensim
>>> gensim.models.word2vec.FAST_VERSION

...the output was a -1. This meant that training would be very, very slow. (70 times slower iirc.)

How do I get your version and still retain the link to the C compiler?
(If I install the current distributed version using conda install gensim, it is linked to my MINGW.)

Pretrained Embedding, TypeError: don't know how to handle uri

@jhlau Hey Jey, I hope you're doing well. This is Gan and I was trying to use your forked version and load a pre-trained word vector, wiki-news-300d-1M.vec, from https://fasttext.cc/docs/en/english-vectors.html; however, I'm getting the error: TypeError: don't know how to handle uri, and I think it's from the smart_open function. I'm training a very small corpus so I think it may be better to initialize with the pre-trained vector.
Following is the code:

vector_size = 100
window_size = 10
min_count = 1
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 50
dm = 0 #0 = dbow; 1 = dmpv
worker_count = 4 #number of parallel processes

wordvec = "/Users/ggao/Downloads/wiki-news-300d-1M.vec" 
import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

pretrained_emb = load_vectors(wordvec)
pre_model = gensim.models.doc2vec.Doc2Vec(documents=train_corpus, dm=dm, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, negative=negative_size, workers=worker_count, pretrained_emb=pretrained_emb, iter=train_epoch)

Do you have any idea what's wrong on here?
Thank you so much and look forward to your reply.

Best,
Gan

AttributeError: 'Doc2Vec' object has no attribute 'wv'

Dear doc2vec developers,
I have been trying to use your code to build my own dbow model based on pre-trained word embeddings of google news corpus. The pre-trained word embeddings were successfully loaded in.

The following errors I encountered could not be solved:

2017-05-01 21:53:33,486 : INFO : expecting 80380 sentences, matching count from corpus used for vocabulary survey
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 740, in worker_loop
tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 669, in _do_train_job
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
File "gensim/models/doc2vec_inner.pyx", line 271, in gensim.models.doc2vec_inner.train_document_dbow
(./gensim/models/doc2vec_inner.c:3511)
word_vectors = model.syn0
AttributeError: 'Doc2Vec' object has no attribute 'wv'

Can you provide me with some hints where could possible go wrong?

Thanks a lot in advance.

Best regards,
Susie

Pre-Trained Word2Vec Models Question

Hi Jhlau,

I'm Quan Van Phu, student at Hanoi University of Science and Technology.
Thank you for sharing Pre-Trained Word2Vec Models: English Wikipedia Skip-gram (1.4GB). I have a question about this model, can you please help me with answering two questions?

Can you tell me about your corpus size?
for instance: ? billion tokens ? different English words.
And Model performance:
SimLex999 = ?
Google Analogy = ?

Hope you could help me to answer these questions as soon as possible

Document Vectors

Hi, what labels should be used to reference a document vector in your pre-trained doc2vec model? Despite the len(m.docvecs) = 35556952: m.docvecs[0], leads to IndexError: list index out of range.

Thanks

'ascii' codec can't decode byte 0xf7 in position 0: ordinal not in range(128)

Hi, I'm using the gensim forked version, but when I'm loading the model I have this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf7 in position 0: ordinal not in range(128)
I try to encode the name of the model like this:
model = g.Doc2Vec.load(model_path.encode('utf-8'))
But then I have this error:
File "C:\Users\fanta\Desktop\gensim-develop\gensim\utils.py", line 311, in _adapt_by_suffix
if fname.endswith('.gz') or fname.endswith('.bz2'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

What I must do solve this error?
Thanks.

load pretrained doc2vec models to up to date gensim versions

Hi,

Thanks for sharing your pre-trained models. They are the only publicly available models afaik.

However, they are not easily loadable to newer gensim versions such as the latest 2.3.0

Do you have a working method for this? Otherwise, could you share the parameters that you found best to create a general pretrained model for en-wiki corpus?

Getting error TypeError: __init__() got an unexpected keyword argument 'pretrained_emb'

When I am trying to run train_model.py , following error is coming .
python train_model.py
Traceback (most recent call last):
File "train_model.py", line 30, in
model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.2-py2.7-linux-x86_64.egg/gensim/models/doc2vec.py", line 607, in init
null_word=dm_concat, **kwargs)
TypeError: init() got an unexpected keyword argument 'pretrained_emb'

How does pretrained_emb parameter work?

In the newest version of Gensim(3.8.0), I surprisingly found that "pretrained_emb=" param worked well, I've read the source code but couldn't find anything related to this param...
My question is, does pretrained embeddings work like a lookup table? When a doc is trained, the words both in doc and pretrained_emb would be initialized as the pretrained vec, other words that're not in the pretrained_emb just initialize randomly(correct me if I'm wrong)
If so, then the pretrained_emb and randomly initialized emb just train together to convergence, it would definitely converge much faster, but I wonder if theoretically the result gets better than word embs all initialized randomly from scratch(I've read your paper and it seems yes, the results get better in practice) .
And another question is, if my pretrained_emb is large enough, say it definitely covers most of the vocab of the doc I wanna train, can I just use the pretrained emb to infer? e.g. Extract the words which are in the pretrained embs to represent the doc, lock these vectors up and only train the doc id to get the doc id emb?
Thanks for your work! Would really appreciate it if u could answer!

Doc2vec Attribute error while loading

1st thanks for shairing your pretrained vector.

i downloaded the pre-trained doc2vec

i run this peice of code

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec.load("doc2vec.bin ")

i get this error
AttributeError: 'Doc2Vec' object has no attribute 'batch_words'

while doc2vec.bin doc2vec.bin.syn0.npy doc2vec.bin.syn1neg.npy
are in the same folder at the root

legacy python?

Hey. I've been trying to use your pre-trained model using the AP corpus, but I get an error on unpickling:

> python infer_test.py     
/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Traceback (most recent call last):
  File "infer_test.py", line 15, in <module>
    m = g.Doc2Vec.load(model)
  File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/models/word2vec.py", line 1762, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py", line 248, in load
    obj = unpickle(fname)
  File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py", line 912, in unpickle
    return _pickle.loads(f.read())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb in position 1: ordinal not in range(128)

Given your use of the codec package, I guess you're using python2? Any chance you could build a python3 version?

AttributeError: Can't get attribute 'DocvecsArray'

Any advice on the following will be greatly appreciated.

I am attempting to load one of the pretrained models, via this code:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec.load('doc2vec.bin')

The stack trace and error:

Traceback (most recent call last):
File "/Users/bruceschechter/Dropbox/dev/pycharm/p056 gensim explore/main.py", line 13, in
main()
File "/Users/bruceschechter/Dropbox/dev/pycharm/p056 gensim explore/main.py", line 8, in main
model = Doc2Vec.load('doc2vec.bin')
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py", line 813, in load
raise ae
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py", line 807, in load
return super(Doc2Vec, cls).load(*args, rethrow=True, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1937, in load
raise ae
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1930, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/utils.py", line 485, in load
obj = unpickle(fname)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/utils.py", line 1460, in unpickle
return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py'>

My config...
MBPro M1, MacOS 12.3.1
Python 3.9.12. (via Homebrew)
pip 22.0.4
Gensim 4.1.2

Thanks!

how is the text preprocessing done ?

Hi, I want to extract the doc2vec features of those sentences in MS COCO. But I'm not quite sure how the preprocessing is performed.

It's said that the articles are tokenised and lowercased using Stanford CoreNLP in the paper. From the files under toy_data/ and the two py files, I guess that an article is squashed into a single line in those *_docs.txt files. But these two files are already processed.

Now I've installed the Stanford CoreNLP and can call it from command line. After concatenating the 5 sentences for a COCO image (seperated by a space), treating is as an article, and writing it into input.txt, my calling is like:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -outputFormat conll -output.columns word -file input.txt

However, the output is not lowercased. How should I modify the command to enable lowercasing ?

By the way, there are other tokenization options shown here, like americanize. Did you use them when training the doc2vec model ? If possible, I hope you can provide the details of your preprocessing method.

Thanks

wiki doc2vec doctags

I was wondering if the original doctags exist to the pre-trained wiki doc2vec model? I know that it is probably very large, but just hoping for the off-chance that you would know where I could download it. Thanks!

'Doc2Vec' object has no attribute 'neg_labels'

Hello, thank you for sharing such models, many thanks. I'm getting this error when I run the infer_text.py code even though I'm using your forked version of gensim. Do you have any suggestion, please?

Pre-processing of text

Is it preferable to perform stemming or stop word removal before feeding in the data while using pre-trained DBOW model?

'Doc2Vec' object has no attribute 'batch_words'

Hi.
Im loading the pretrained model Doc2Vec from English Wikipedia (in the same folder I have the .bin and the syn0 and syn1. But when I load the model I have the error: 'Doc2Vec' object has no attribute 'batch_words'. It seems that the problem is due to diferents versions of gensim. Could you help me please?

Here is the complete error message:
``
AttributeErrorTraceback (most recent call last)
in ()
----> 1 model = g.Doc2Vec.load(model_path)
2 test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.pyc in load(cls, *args, **kwargs)
691 logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.')
692 from gensim.models.deprecated.doc2vec import load_old_doc2vec
--> 693 return load_old_doc2vec(*args, **kwargs)
694
695 def estimate_memory(self, vocab_size=None, report=None):

/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/doc2vec.pyc in load_old_doc2vec(*args, **kwargs)
107 'iter': old_model.iter,
108 'sorted_vocab': old_model.sorted_vocab,
--> 109 'batch_words': old_model.batch_words,
110 'compute_loss': old_model.dict.get('compute_loss', None)
111 }

AttributeError: 'Doc2Vec' object has no attribute 'batch_words'

This is the code
#parameters
model_path="myfolder/test/doc2vec.bin"
test_docs="myfolder/test/test.txt"
output_file="myfolder/test/test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
model = g.Doc2Vec.load(model_path)

model loading

I have been using google colab for my project with python version 3.6. The pretrained model "English Wikipedia" loads perfectly and even works fine for my downstream task. But I want to know if it will load in the same way without any problem in future too.

Can the format/extension of Pre-trained word embeddings be '.bin' refrenced in this line https://github.com/jhlau/doc2vec/blob/158df84b83c1b2b3038c420df03a3f063f7a50be/train_model.py#L17

Please, help me here.
Should the format/extension of Pre-trained word embeddings in the below line always be '.txt'

pretrained_emb = "toy_data/pretrained_word_embeddings.txt" #None if use without pretrained embeddings

I want to use the Associated Press News DBOW (0.6GB) as the pretrained_emb and further fine-tune it for my corpus. How would I do it?
I am thinking of using the doc2vec.bin file in the Associated Press News DBOW (0.6GB), will it work?

About getting document vector

Hello, I'm a very new student of doc2vec, and have some questions about document vector.
What I'm trying to get is vector of phrase like 'cat like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below

import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]

When I tried this code, I could get vector for one word 'cat', but not 'cat like mammal'.
Because, word2vec only provide vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below

phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)

When I tried this code, I could get vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase) again and again.
Because infer_vector has 'steps'.

When I set steps=0, I get always same vector.
phrase_vec = m.infer_vector(phrase, steps=0)

However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector.(If I'm wrong, plz correct me)

So... here are some questions.

  1. Is it the right way to use infer_vector() with 0 steps to get vector of phrase?
  2. If it is right averaging vector of words to get document vector, is there no need to use infer_vector()?
  3. What is model.docvecs for?

AttributeError: 'Doc2Vec' object has no attribute 'neg_labels'

While I'm loading your pre-trained DOC2VEC models, aka. English Wikipedia DBOW (1.4GB) and Associated Press News DBOW (0.6GB) , error AttributeError: 'Doc2Vec' object has no attribute 'neg_labels' occurs.

I also notice that in the loaded model, negative is set to 5 but no neg_labels are provided.

About the doc2vec pretrained_model

Hi, when I use the enwiki_dbow doc2vec pretrained_model, I got a problem with loading the model.
(I decompress the enwiki_dbow first)

model = gensim.models.Doc2Vec.load_word2vec_format( './enwiki_dbow/doc2vec.bin', binary=True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

pretrained_emb argument is not recognized

Hi,

I am trying to use your code and to test it with the toy data. However, the pretrained_emb argument is not recognized. This is the code:

`#python example to train doc2vec model (with or without pre-trained word embeddings)

import gensim.models as g
import logging

#doc2vec parameters
vector_size = 300
window_size = 15
min_count = 1
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 100
dm = 0 #0 = dbow; 1 = dmpv
worker_count = 1 #number of parallel processes

#pretrained word embeddings
pretrained_emb = "toy_data/pretrained_word_embeddings.txt" #None if use without pretrained embeddings

#input corpus
train_corpus = "toy_data/train_docs.txt"

#output model
saved_path = "toy_data/model.bin"

#enable logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#train doc2vec model
docs = g.doc2vec.TaggedLineDocument(train_corpus)

model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, pretrained_emb=pretrained_emb,dbow_words=1, dm_concat=1, iter=train_epoch)

#save model
model.save(saved_path)`

And this is the error:

Traceback (most recent call last): File "C:/Users/12714818_Admin/Desktop/CMCRC/Boundlss_2017/May-Aug/Context_including/Conversation_clustering/src/train_model.py", line 31, in <module> model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, pretrained_emb=pretrained_emb,dbow_words=1, dm_concat=1, iter=train_epoch) File "C:\ProgramData\Anaconda3\envs\py27\lib\site-packages\gensim\models\doc2vec.py", line 625, in __init__ **kwargs) TypeError: __init__() got an unexpected keyword argument 'pretrained_emb'

I am using python 2.7.

'Word2Vec' object has no attribute 'infer_vector'

I've cloned the repo, downloaded the pre-trained Wikipedia model, and installed Gensim via pip install git+https://github.com/jhlau/gensim.

Then I pasted the downloaded model files into the toy_data directory and changed the model line in the file to: model="toy_data/word2vec.bin".

However, when I run infer_test.py I get the following error:

Traceback (most recent call last):
  File "infer_test.py", line 25, in <module>
    output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'

pretrained_word_embeddings

Hi,

I want to use a pretrained_word_embeddings with larger size of BOW. How can I get that? I found your answer on one stackoverflow question, saying the .txt file should be C-word2vec tool text format. Can you say more on how to get C-word2vec format?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.