jhlau / doc2vec Goto Github PK
View Code? Open in Web Editor NEWPython scripts for training/testing paragraph vectors
License: Apache License 2.0
Python scripts for training/testing paragraph vectors
License: Apache License 2.0
I understand that in order to use pretrained word embeddings to train the doc2vec models, we should install your forked version of gensim. However, I have failed to configure it properly with a C compiler. (MINGW in my case, could be BLAS etc.) I'm on windows and am using Anaconda.
I tried installing straight from setup.py, using pip, and even created a conda package and installed using that. Each time, I did succeed in installing the forked version of gensim.
However, every time I tried the following commands in a python shell-
>>> import gensim
>>> gensim.models.word2vec.FAST_VERSION
...the output was a -1. This meant that training would be very, very slow. (70 times slower iirc.)
How do I get your version and still retain the link to the C compiler?
(If I install the current distributed version using conda install gensim
, it is linked to my MINGW.)
Line 14 in 158df84
@jhlau Hey Jey, I hope you're doing well. This is Gan and I was trying to use your forked version and load a pre-trained word vector, wiki-news-300d-1M.vec
, from https://fasttext.cc/docs/en/english-vectors.html
; however, I'm getting the error: TypeError: don't know how to handle uri
, and I think it's from the smart_open
function. I'm training a very small corpus so I think it may be better to initialize with the pre-trained vector.
Following is the code:
vector_size = 100
window_size = 10
min_count = 1
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 50
dm = 0 #0 = dbow; 1 = dmpv
worker_count = 4 #number of parallel processes
wordvec = "/Users/ggao/Downloads/wiki-news-300d-1M.vec"
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data
pretrained_emb = load_vectors(wordvec)
pre_model = gensim.models.doc2vec.Doc2Vec(documents=train_corpus, dm=dm, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, negative=negative_size, workers=worker_count, pretrained_emb=pretrained_emb, iter=train_epoch)
Do you have any idea what's wrong on here?
Thank you so much and look forward to your reply.
Best,
Gan
Dear doc2vec developers,
I have been trying to use your code to build my own dbow model based on pre-trained word embeddings of google news corpus. The pre-trained word embeddings were successfully loaded in.
Can you provide me with some hints where could possible go wrong?
Thanks a lot in advance.
Best regards,
Susie
Hi Jhlau,
I'm Quan Van Phu, student at Hanoi University of Science and Technology.
Thank you for sharing Pre-Trained Word2Vec Models: English Wikipedia Skip-gram (1.4GB). I have a question about this model, can you please help me with answering two questions?
Can you tell me about your corpus size?
for instance: ? billion tokens ? different English words.
And Model performance:
SimLex999 = ?
Google Analogy = ?
Hope you could help me to answer these questions as soon as possible
Hi, what labels should be used to reference a document vector in your pre-trained doc2vec model? Despite the len(m.docvecs) = 35556952: m.docvecs[0], leads to IndexError: list index out of range.
Thanks
How can I save the model in non-binary format?
Thank you.
Hi, I'm using the gensim forked version, but when I'm loading the model I have this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf7 in position 0: ordinal not in range(128)
I try to encode the name of the model like this:
model = g.Doc2Vec.load(model_path.encode('utf-8'))
But then I have this error:
File "C:\Users\fanta\Desktop\gensim-develop\gensim\utils.py", line 311, in _adapt_by_suffix
if fname.endswith('.gz') or fname.endswith('.bz2'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str
What I must do solve this error?
Thanks.
Hi,
Thanks for sharing your pre-trained models. They are the only publicly available models afaik.
However, they are not easily loadable to newer gensim versions such as the latest 2.3.0
Do you have a working method for this? Otherwise, could you share the parameters that you found best to create a general pretrained model for en-wiki corpus?
When I am trying to run train_model.py , following error is coming .
python train_model.py
Traceback (most recent call last):
File "train_model.py", line 30, in
model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.2-py2.7-linux-x86_64.egg/gensim/models/doc2vec.py", line 607, in init
null_word=dm_concat, **kwargs)
TypeError: init() got an unexpected keyword argument 'pretrained_emb'
I have trained Wikipedia Chinese word2vec.I would like to use this method to train Wikipedia's Chinese data sets in ubuntu, what should I do?
In the newest version of Gensim(3.8.0), I surprisingly found that "pretrained_emb=" param worked well, I've read the source code but couldn't find anything related to this param...
My question is, does pretrained embeddings work like a lookup table? When a doc is trained, the words both in doc and pretrained_emb would be initialized as the pretrained vec, other words that're not in the pretrained_emb just initialize randomly(correct me if I'm wrong)
If so, then the pretrained_emb and randomly initialized emb just train together to convergence, it would definitely converge much faster, but I wonder if theoretically the result gets better than word embs all initialized randomly from scratch(I've read your paper and it seems yes, the results get better in practice) .
And another question is, if my pretrained_emb is large enough, say it definitely covers most of the vocab of the doc I wanna train, can I just use the pretrained emb to infer? e.g. Extract the words which are in the pretrained embs to represent the doc, lock these vectors up and only train the doc id to get the doc id emb?
Thanks for your work! Would really appreciate it if u could answer!
I try to run your train_model.py with my pretrained word2vec model and acquried this error then i try to use pretrained word2vec AP news skip-gram and english wiki but this error keep coming
Is there anymore step i need to run this train_model.py?
ps: i already use your forked gensim too
1st thanks for shairing your pretrained vector.
i downloaded the pre-trained doc2vec
i run this peice of code
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec.load("doc2vec.bin ")
i get this error
AttributeError: 'Doc2Vec' object has no attribute 'batch_words'
while doc2vec.bin doc2vec.bin.syn0.npy doc2vec.bin.syn1neg.npy
are in the same folder at the root
what is the time complexity of doc2vec model? It's very effective in computing.
Hey. I've been trying to use your pre-trained model using the AP corpus, but I get an error on unpickling:
> python infer_test.py
/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Traceback (most recent call last):
File "infer_test.py", line 15, in <module>
m = g.Doc2Vec.load(model)
File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/models/word2vec.py", line 1762, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py", line 248, in load
obj = unpickle(fname)
File "/home/andy/anaconda3/lib/python3.5/site-packages/gensim/utils.py", line 912, in unpickle
return _pickle.loads(f.read())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb in position 1: ordinal not in range(128)
Given your use of the codec package, I guess you're using python2? Any chance you could build a python3 version?
Any advice on the following will be greatly appreciated.
I am attempting to load one of the pretrained models, via this code:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
model = Doc2Vec.load('doc2vec.bin')
The stack trace and error:
Traceback (most recent call last):
File "/Users/bruceschechter/Dropbox/dev/pycharm/p056 gensim explore/main.py", line 13, in
main()
File "/Users/bruceschechter/Dropbox/dev/pycharm/p056 gensim explore/main.py", line 8, in main
model = Doc2Vec.load('doc2vec.bin')
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py", line 813, in load
raise ae
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py", line 807, in load
return super(Doc2Vec, cls).load(*args, rethrow=True, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1937, in load
raise ae
File "/opt/homebrew/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1930, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/utils.py", line 485, in load
obj = unpickle(fname)
File "/opt/homebrew/lib/python3.9/site-packages/gensim/utils.py", line 1460, in unpickle
return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/opt/homebrew/lib/python3.9/site-packages/gensim/models/doc2vec.py'>
My config...
MBPro M1, MacOS 12.3.1
Python 3.9.12. (via Homebrew)
pip 22.0.4
Gensim 4.1.2
Thanks!
Hi, I want to extract the doc2vec features of those sentences in MS COCO. But I'm not quite sure how the preprocessing is performed.
It's said that the articles are tokenised and lowercased using Stanford CoreNLP in the paper. From the files under toy_data/ and the two py files, I guess that an article is squashed into a single line in those *_docs.txt files. But these two files are already processed.
Now I've installed the Stanford CoreNLP and can call it from command line. After concatenating the 5 sentences for a COCO image (seperated by a space), treating is as an article, and writing it into input.txt, my calling is like:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -outputFormat conll -output.columns word -file input.txt
However, the output is not lowercased. How should I modify the command to enable lowercasing ?
By the way, there are other tokenization options shown here, like americanize
. Did you use them when training the doc2vec model ? If possible, I hope you can provide the details of your preprocessing method.
Thanks
I was wondering if the original doctags exist to the pre-trained wiki doc2vec model? I know that it is probably very large, but just hoping for the off-chance that you would know where I could download it. Thanks!
Hello, thank you for sharing such models, many thanks. I'm getting this error when I run the infer_text.py code even though I'm using your forked version of gensim. Do you have any suggestion, please?
Is it preferable to perform stemming or stop word removal before feeding in the data while using pre-trained DBOW model?
Can you tell where is this particular file present. I can't seem to find the word2vec.bin file.
Hi.
Im loading the pretrained model Doc2Vec from English Wikipedia (in the same folder I have the .bin and the syn0 and syn1. But when I load the model I have the error: 'Doc2Vec' object has no attribute 'batch_words'. It seems that the problem is due to diferents versions of gensim. Could you help me please?
Here is the complete error message:
``
AttributeErrorTraceback (most recent call last)
in ()
----> 1 model = g.Doc2Vec.load(model_path)
2 test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]
/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.pyc in load(cls, *args, **kwargs)
691 logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.')
692 from gensim.models.deprecated.doc2vec import load_old_doc2vec
--> 693 return load_old_doc2vec(*args, **kwargs)
694
695 def estimate_memory(self, vocab_size=None, report=None):
/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/doc2vec.pyc in load_old_doc2vec(*args, **kwargs)
107 'iter': old_model.iter,
108 'sorted_vocab': old_model.sorted_vocab,
--> 109 'batch_words': old_model.batch_words,
110 'compute_loss': old_model.dict.get('compute_loss', None)
111 }
AttributeError: 'Doc2Vec' object has no attribute 'batch_words'
This is the code
#parameters
model_path="myfolder/test/doc2vec.bin"
test_docs="myfolder/test/test.txt"
output_file="myfolder/test/test_vectors.txt"
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000
#load model
model = g.Doc2Vec.load(model_path)
I have been using google colab for my project with python version 3.6. The pretrained model "English Wikipedia" loads perfectly and even works fine for my downstream task. But I want to know if it will load in the same way without any problem in future too.
Please, help me here.
Should the format/extension of Pre-trained word embeddings in the below line always be '.txt'
Line 17 in 158df84
Hello, I'm a very new student of doc2vec, and have some questions about document vector.
What I'm trying to get is vector of phrase like 'cat like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get vector for one word 'cat', but not 'cat like mammal'.
Because, word2vec only provide vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase) again and again.
Because infer_vector has 'steps'.
When I set steps=0, I get always same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector.(If I'm wrong, plz correct me)
So... here are some questions.
While I'm loading your pre-trained DOC2VEC models, aka. English Wikipedia DBOW (1.4GB) and Associated Press News DBOW (0.6GB) , error AttributeError: 'Doc2Vec' object has no attribute 'neg_labels' occurs.
I also notice that in the loaded model, negative is set to 5 but no neg_labels are provided.
Hi, when I use the enwiki_dbow doc2vec pretrained_model, I got a problem with loading the model.
(I decompress the enwiki_dbow first)
model = gensim.models.Doc2Vec.load_word2vec_format( './enwiki_dbow/doc2vec.bin', binary=True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
Hi,
I am trying to use your code and to test it with the toy data. However, the pretrained_emb argument is not recognized. This is the code:
`#python example to train doc2vec model (with or without pre-trained word embeddings)
import gensim.models as g
import logging
#doc2vec parameters
vector_size = 300
window_size = 15
min_count = 1
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 100
dm = 0 #0 = dbow; 1 = dmpv
worker_count = 1 #number of parallel processes
#pretrained word embeddings
pretrained_emb = "toy_data/pretrained_word_embeddings.txt" #None if use without pretrained embeddings
#input corpus
train_corpus = "toy_data/train_docs.txt"
#output model
saved_path = "toy_data/model.bin"
#enable logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
#train doc2vec model
docs = g.doc2vec.TaggedLineDocument(train_corpus)
model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, pretrained_emb=pretrained_emb,dbow_words=1, dm_concat=1, iter=train_epoch)
#save model
model.save(saved_path)`
And this is the error:
Traceback (most recent call last): File "C:/Users/12714818_Admin/Desktop/CMCRC/Boundlss_2017/May-Aug/Context_including/Conversation_clustering/src/train_model.py", line 31, in <module> model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, pretrained_emb=pretrained_emb,dbow_words=1, dm_concat=1, iter=train_epoch) File "C:\ProgramData\Anaconda3\envs\py27\lib\site-packages\gensim\models\doc2vec.py", line 625, in __init__ **kwargs) TypeError: __init__() got an unexpected keyword argument 'pretrained_emb'
I am using python 2.7.
Line 7 in 158df84
I've cloned the repo, downloaded the pre-trained Wikipedia model, and installed Gensim via pip install git+https://github.com/jhlau/gensim
.
Then I pasted the downloaded model files into the toy_data
directory and changed the model line in the file to: model="toy_data/word2vec.bin"
.
However, when I run infer_test.py
I get the following error:
Traceback (most recent call last):
File "infer_test.py", line 25, in <module>
output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'
Thank you for this excellent work!
Are you guys planning on releasing a model, trained on tweets?
Is this wikipedia skip-gram model supports the mac?
Hi,
I want to use a pretrained_word_embeddings with larger size of BOW. How can I get that? I found your answer on one stackoverflow question, saying the .txt file should be C-word2vec tool text format. Can you say more on how to get C-word2vec format?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.