jhlau / doc2vec Goto Github PK

View Code? Open in Web Editor NEW

642.0 22.0 191.0 1.22 MB

Python scripts for training/testing paragraph vectors

License: Apache License 2.0

Python 100.00%

doc2vec's People

Stargazers

Watchers

Forkers

stevenlol qiuzhangcheng lxc-xx katbailey wuzhongdehua tssupriya85 rap9430 andrewlesson micahstubbs hyzcn jingshuangliu22 mywoot theodoreyin ectsang naokiota pesong zhangjiaobxy varunchitale mjoanac mindis wrat sandra1993 souravroy0708 shiyongde beifeizhou danielkolsi apurv1205 keeganmccallum batman001 bj1123 chengguobiao bionicles strategist922 iamukasa vishwgupta generalsemantics anshajgoel lonjoy vijayravik ravi-code-ranjan istiyakv msdeep14 lijhong dddragons channingping tpr-ly connietong liudanmeng siva2k16 aymansalama huangzewen mehtamanan0 premy990 albertusk95 nitinnat zhyq shgidi tanganyao eezprince huten10 yashugupta786 jhashanti akbari59 kundan-git neuralnetworkingtechnologies kellymarchisio hafloresc tharakagayan ronaktanna1 cmwenliu deepak0004 tumeteor challengerlihan hy2hh tanvirfuad magellen rowankyp karanr-hexaware allenmujie zhengjunzhao1991 awalin hoangcuong2011 dharma2018 dhirajpatnaik16297 wenwenllf shubhampachori12110095 biranchi2018 murdrae herley-shaori jingwangfei steveatgit kdwooo fancycheung kenneds6 esskay0000 genix-greysh diosguo dangxuanhong changchunli zhouyonglong

doc2vec's Issues

Forked version of gensim

I understand that in order to use pretrained word embeddings to train the doc2vec models, we should install your forked version of gensim. However, I have failed to configure it properly with a C compiler. (MINGW in my case, could be BLAS etc.) I'm on windows and am using Anaconda.

I tried installing straight from setup.py, using pip, and even created a conda package and installed using that. Each time, I did succeed in installing the forked version of gensim.

However, every time I tried the following commands in a python shell-

>>> import gensim
>>> gensim.models.word2vec.FAST_VERSION

...the output was a -1. This meant that training would be very, very slow. (70 times slower iirc.)

How do I get your version and still retain the link to the C compiler?
(If I install the current distributed version using conda install gensim, it is linked to my MINGW.)

worker_count == No of Cores in we want to use ? in this line https://github.com/jhlau/doc2vec/blob/158df84b83c1b2b3038c420df03a3f063f7a50be/train_model.py#L14

doc2vec/train_model.py

Line 14 in 158df84

worker_count = 1 #number of parallel processes

What does the variable worker_count mean? Does it imply the number of cores we want to use?
When worker_count = -1, does it mean, use all cores/processes (will it be applicable here)?

Pretrained Embedding, TypeError: don't know how to handle uri

@jhlau Hey Jey, I hope you're doing well. This is Gan and I was trying to use your forked version and load a pre-trained word vector, wiki-news-300d-1M.vec, from https://fasttext.cc/docs/en/english-vectors.html; however, I'm getting the error: TypeError: don't know how to handle uri, and I think it's from the smart_open function. I'm training a very small corpus so I think it may be better to initialize with the pre-trained vector.
Following is the code:

vector_size = 100
window_size = 10
min_count = 1
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 50
dm = 0 #0 = dbow; 1 = dmpv
worker_count = 4 #number of parallel processes

wordvec = "/Users/ggao/Downloads/wiki-news-300d-1M.vec" 
import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

pretrained_emb = load_vectors(wordvec)
pre_model = gensim.models.doc2vec.Doc2Vec(documents=train_corpus, dm=dm, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, negative=negative_size, workers=worker_count, pretrained_emb=pretrained_emb, iter=train_epoch)

Do you have any idea what's wrong on here?
Thank you so much and look forward to your reply.

Best,
Gan

AttributeError: 'Doc2Vec' object has no attribute 'wv'

Dear doc2vec developers,
I have been trying to use your code to build my own dbow model based on pre-trained word embeddings of google news corpus. The pre-trained word embeddings were successfully loaded in.

The following errors I encountered could not be solved:

2017-05-01 21:53:33,486 : INFO : expecting 80380 sentences, matching count from corpus used for vocabulary survey
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.target(*self.args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 740, in worker_loop
tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 669, in _do_train_job
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
File "gensim/models/doc2vec_inner.pyx", line 271, in gensim.models.doc2vec_inner.train_document_dbow
(./gensim/models/doc2vec_inner.c:3511)
word_vectors = model.syn0
AttributeError: 'Doc2Vec' object has no attribute 'wv'

Can you provide me with some hints where could possible go wrong?

Thanks a lot in advance.

Best regards,
Susie

Pre-Trained Word2Vec Models Question

Hi Jhlau,

I'm Quan Van Phu, student at Hanoi University of Science and Technology.
Thank you for sharing Pre-Trained Word2Vec Models: English Wikipedia Skip-gram (1.4GB). I have a question about this model, can you please help me with answering two questions?

Can you tell me about your corpus size?
for instance: ? billion tokens ? different English words.
And Model performance:
SimLex999 = ?
Google Analogy = ?

Hope you could help me to answer these questions as soon as possible

Document Vectors

Hi, what labels should be used to reference a document vector in your pre-trained doc2vec model? Despite the len(m.docvecs) = 35556952: m.docvecs[0], leads to IndexError: list index out of range.

Thanks

save model in non-binary format

How can I save the model in non-binary format?
Thank you.

'ascii' codec can't decode byte 0xf7 in position 0: ordinal not in range(128)

Hi, I'm using the gensim forked version, but when I'm loading the model I have this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf7 in position 0: ordinal not in range(128)
I try to encode the name of the model like this:
model = g.Doc2Vec.load(model_path.encode('utf-8'))
But then I have this error:
File "C:\Users\fanta\Desktop\gensim-develop\gensim\utils.py", line 311, in _adapt_by_suffix
if fname.endswith('.gz') or fname.endswith('.bz2'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

What I must do solve this error?
Thanks.

error while loading model

I am using python3
also i tried loading with .tgz file directly...please help

load pretrained doc2vec models to up to date gensim versions

Hi,

Thanks for sharing your pre-trained models. They are the only publicly available models afaik.

However, they are not easily loadable to newer gensim versions such as the latest 2.3.0

Do you have a working method for this? Otherwise, could you share the parameters that you found best to create a general pretrained model for en-wiki corpus?

Getting error TypeError: init() got an unexpected keyword argument 'pretrained_emb'

When I am trying to run train_model.py , following error is coming .
python train_model.py
Traceback (most recent call last):
File "train_model.py", line 30, in
model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.2-py2.7-linux-x86_64.egg/gensim/models/doc2vec.py", line 607, in init
null_word=dm_concat, **kwargs)
TypeError: init() got an unexpected keyword argument 'pretrained_emb'

train Wikipedia's Chinese data sets for doc2vec

I have trained Wikipedia Chinese word2vec.I would like to use this method to train Wikipedia's Chinese data sets in ubuntu, what should I do?

How does pretrained_emb parameter work?

In the newest version of Gensim(3.8.0), I surprisingly found that "pretrained_emb=" param worked well, I've read the source code but couldn't find anything related to this param...
My question is, does pretrained embeddings work like a lookup table? When a doc is trained, the words both in doc and pretrained_emb would be initialized as the pretrained vec, other words that're not in the pretrained_emb just initialize randomly(correct me if I'm wrong)
If so, then the pretrained_emb and randomly initialized emb just train together to convergence, it would definitely converge much faster, but I wonder if theoretically the result gets better than word embs all initialized randomly from scratch(I've read your paper and it seems yes, the results get better in practice) .
And another question is, if my pretrained_emb is large enough, say it definitely covers most of the vocab of the doc I wanna train, can I just use the pretrained emb to infer? e.g. Extract the words which are in the pretrained embs to represent the doc, lock these vectors up and only train the doc id to get the doc id emb?
Thanks for your work! Would really appreciate it if u could answer!

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I try to run your train_model.py with my pretrained word2vec model and acquried this error then i try to use pretrained word2vec AP news skip-gram and english wiki but this error keep coming
Is there anymore step i need to run this train_model.py?

ps: i already use your forked gensim too

Doc2vec Attribute error while loading

1st thanks for shairing your pretrained vector.

i downloaded the pre-trained doc2vec

i run this peice of code

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec.load("doc2vec.bin ")

i get this error
AttributeError: 'Doc2Vec' object has no attribute 'batch_words'

while doc2vec.bin doc2vec.bin.syn0.npy doc2vec.bin.syn1neg.npy
are in the same folder at the root

what is the time complexity of doc2vec model

what is the time complexity of doc2vec model? It's very effective in computing.

legacy python?

Hey. I've been trying to use your pre-trained model using the AP corpus, but I get an error on unpickling:

> python infer_test.py     
/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Traceback (most recent call last):
  File "infer_test.py", line 15, in <module>
    m = g.Doc2Vec.load(model)
  File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/models/word2vec.py", line 1762, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py", line 248, in load
    obj = unpickle(fname)
  File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py", line 912, in unpickle
    return _pickle.loads(f.read())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb in position 1: ordinal not in range(128)

Given your use of the codec package, I guess you're using python2? Any chance you could build a python3 version?

AttributeError: Can't get attribute 'DocvecsArray'

Any advice on the following will be greatly appreciated.

I am attempting to load one of the pretrained models, via this code:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec.load('doc2vec.bin')

The stack trace and error:

Traceback (most recent call last):
File "/Users/bruceschechter/Dropbox/dev/pycharm/p056 gensim explore/main.py", line 13, in
main()
File "/Users/bruceschechter/Dropbox/dev/pycharm/p056 gensim explore/main.py", line 8, in main
model = Doc2Vec.load('doc2vec.bin')
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py", line 813, in load
raise ae
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py", line 807, in load
return super(Doc2Vec, cls).load(*args, rethrow=True, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1937, in load
raise ae
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1930, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/utils.py", line 485, in load
obj = unpickle(fname)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/utils.py", line 1460, in unpickle
return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py'>

My config...
MBPro M1, MacOS 12.3.1
Python 3.9.12. (via Homebrew)
pip 22.0.4
Gensim 4.1.2

Thanks!

how is the text preprocessing done ?

Hi, I want to extract the doc2vec features of those sentences in MS COCO. But I'm not quite sure how the preprocessing is performed.

It's said that the articles are tokenised and lowercased using Stanford CoreNLP in the paper. From the files under toy_data/ and the two py files, I guess that an article is squashed into a single line in those *_docs.txt files. But these two files are already processed.

Now I've installed the Stanford CoreNLP and can call it from command line. After concatenating the 5 sentences for a COCO image (seperated by a space), treating is as an article, and writing it into input.txt, my calling is like:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -outputFormat conll -output.columns word -file input.txt

However, the output is not lowercased. How should I modify the command to enable lowercasing ?

By the way, there are other tokenization options shown here, like americanize. Did you use them when training the doc2vec model ? If possible, I hope you can provide the details of your preprocessing method.

Thanks

wiki doc2vec doctags

I was wondering if the original doctags exist to the pre-trained wiki doc2vec model? I know that it is probably very large, but just hoping for the off-chance that you would know where I could download it. Thanks!

'Doc2Vec' object has no attribute 'neg_labels'

Hello, thank you for sharing such models, many thanks. I'm getting this error when I run the infer_text.py code even though I'm using your forked version of gensim. Do you have any suggestion, please?

Pre-processing of text

Is it preferable to perform stemming or stop word removal before feeding in the data while using pre-trained DBOW model?

Regarding the word2vec.bin file

Can you tell where is this particular file present. I can't seem to find the word2vec.bin file.

'Doc2Vec' object has no attribute 'batch_words'

Hi.
Im loading the pretrained model Doc2Vec from English Wikipedia (in the same folder I have the .bin and the syn0 and syn1. But when I load the model I have the error: 'Doc2Vec' object has no attribute 'batch_words'. It seems that the problem is due to diferents versions of gensim. Could you help me please?

Here is the complete error message:
``
AttributeErrorTraceback (most recent call last)
in ()
----> 1 model = g.Doc2Vec.load(model_path)
2 test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.pyc in load(cls, *args, **kwargs)
691 logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.')
692 from gensim.models.deprecated.doc2vec import load_old_doc2vec
--> 693 return load_old_doc2vec(*args, **kwargs)
694
695 def estimate_memory(self, vocab_size=None, report=None):

/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/doc2vec.pyc in load_old_doc2vec(*args, **kwargs)
107 'iter': old_model.iter,
108 'sorted_vocab': old_model.sorted_vocab,
--> 109 'batch_words': old_model.batch_words,
110 'compute_loss': old_model.dict.get('compute_loss', None)
111 }

AttributeError: 'Doc2Vec' object has no attribute 'batch_words'

This is the code
#parameters
model_path="myfolder/test/doc2vec.bin"
test_docs="myfolder/test/test.txt"
output_file="myfolder/test/test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
model = g.Doc2Vec.load(model_path)

model loading

I have been using google colab for my project with python version 3.6. The pretrained model "English Wikipedia" loads perfectly and even works fine for my downstream task. But I want to know if it will load in the same way without any problem in future too.

Can the format/extension of Pre-trained word embeddings be '.bin' refrenced in this line https://github.com/jhlau/doc2vec/blob/158df84b83c1b2b3038c420df03a3f063f7a50be/train_model.py#L17

Please, help me here.
Should the format/extension of Pre-trained word embeddings in the below line always be '.txt'

doc2vec/train_model.py

Line 17 in 158df84

 pretrained_emb = "toy_data/pretrained_word_embeddings.txt" #None if use without pretrained embeddings 

I want to use the Associated Press News DBOW (0.6GB) as the pretrained_emb and further fine-tune it for my corpus. How would I do it?
I am thinking of using the doc2vec.bin file in the Associated Press News DBOW (0.6GB), will it work?

About getting document vector

Hello, I'm a very new student of doc2vec, and have some questions about document vector.
What I'm trying to get is vector of phrase like 'cat like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below

import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]

When I tried this code, I could get vector for one word 'cat', but not 'cat like mammal'.
Because, word2vec only provide vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below

phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)

When I tried this code, I could get vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase) again and again.
Because infer_vector has 'steps'.

When I set steps=0, I get always same vector.
phrase_vec = m.infer_vector(phrase, steps=0)

However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector.(If I'm wrong, plz correct me)

So... here are some questions.

Is it the right way to use infer_vector() with 0 steps to get vector of phrase?
If it is right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is model.docvecs for?

AttributeError: 'Doc2Vec' object has no attribute 'neg_labels'

While I'm loading your pre-trained DOC2VEC models, aka. English Wikipedia DBOW (1.4GB) and Associated Press News DBOW (0.6GB) , error AttributeError: 'Doc2Vec' object has no attribute 'neg_labels' occurs.

I also notice that in the loaded model, negative is set to 5 but no neg_labels are provided.

About the doc2vec pretrained_model

Hi, when I use the enwiki_dbow doc2vec pretrained_model, I got a problem with loading the model.
(I decompress the enwiki_dbow first)

model = gensim.models.Doc2Vec.load_word2vec_format( './enwiki_dbow/doc2vec.bin', binary=True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

pretrained_emb argument is not recognized

Hi,

I am trying to use your code and to test it with the toy data. However, the pretrained_emb argument is not recognized. This is the code:

`#python example to train doc2vec model (with or without pre-trained word embeddings)

import gensim.models as g
import logging

#doc2vec parameters
vector_size = 300
window_size = 15
min_count = 1
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 100
dm = 0 #0 = dbow; 1 = dmpv
worker_count = 1 #number of parallel processes

#pretrained word embeddings
pretrained_emb = "toy_data/pretrained_word_embeddings.txt" #None if use without pretrained embeddings

#input corpus
train_corpus = "toy_data/train_docs.txt"

#output model
saved_path = "toy_data/model.bin"

#enable logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#train doc2vec model
docs = g.doc2vec.TaggedLineDocument(train_corpus)

model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, pretrained_emb=pretrained_emb,dbow_words=1, dm_concat=1, iter=train_epoch)

#save model
model.save(saved_path)`

And this is the error:

Traceback (most recent call last): File "C:/Users/12714818_Admin/Desktop/CMCRC/Boundlss_2017/May-Aug/Context_including/Conversation_clustering/src/train_model.py", line 31, in <module> model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, pretrained_emb=pretrained_emb,dbow_words=1, dm_concat=1, iter=train_epoch) File "C:\ProgramData\Anaconda3\envs\py27\lib\site-packages\gensim\models\doc2vec.py", line 625, in __init__ **kwargs) TypeError: __init__() got an unexpected keyword argument 'pretrained_emb'

I am using python 2.7.

vector_size = 300 : Can I increase the vector size while using the Associated Press News DBOW (0.6GB)

doc2vec/train_model.py

Line 7 in 158df84

vector_size = 300

In the above line can I increase the vector_size while fine-tuning it with Associated Press News DBOW (0.6GB).

'Word2Vec' object has no attribute 'infer_vector'

I've cloned the repo, downloaded the pre-trained Wikipedia model, and installed Gensim via pip install git+https://github.com/jhlau/gensim.

Then I pasted the downloaded model files into the toy_data directory and changed the model line in the file to: model="toy_data/word2vec.bin".

However, when I run infer_test.py I get the following error:

Traceback (most recent call last):
  File "infer_test.py", line 25, in <module>
    output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'

jhlau / doc2vec Goto Github PK

doc2vec's People

Stargazers

Watchers

Forkers

doc2vec's Issues

The following errors I encountered could not be solved:

Recommend Projects

Recommend Topics

Recommend Org