Comments (2)
@gojomo Thanks for the very detailed reply. Now I feel completely stupid: I skimmed the function arguments up to max_worker_size
, was happy that I found it, and never managed to take a look at the argument right below it. Ungh. In any case, I think it would have made sense to make semantic hyper-parameters separate from implementation details (i.e. max_final_vocab
, epochs
, etc. from max_vocab_size
and the like), but no use crying over spilt milk, I guess.
I have also since realized why min_count
was applied at the end, so that part indeed works as it should.
Thank you for the additional heads up about worker count and the loss; I was aware of the first one, but not the second.
from gensim.
max_vocab_size
is an awfully-named parameter with unintuitive effects during its mid-count-pruning that in general should not be used unless there's no other way to complete the vocabulary-scan in your RAM without it. And, even then, you'll want to set max_vocab_size
to as large of a value as is possible with your RAM, to minimize the count-fouling effects of mid-count pruning, rather than anything near the final sized vocabulary you want.
As you note, the escalating-floor simply means the next pruning will automatically discard all knowledge of tokens with fewer occurrences, at that pruning. Still, at the very end of the vocabulary-survey, there could be any number of tokens with tallied occurrences less than that threshold that have been noted since the last prune. And then only those with frequency less than the effective_min_count
(here just your specified min_count
) will be ignored for the final surviving vocabulary.
So many tokens with interim counts up to that rising threshold will have been pruned during the many in-count prunes. And further, tokens with true counts far higher than that threshold, but below that threshold at one or more of the prunings, will have artifically-lower counts (because earlier interim tallies were discarded) and may wind up being ignored entirely (if their final interim count is below min_count
).
Why is such a confusing parameter available? It matches the name, and behavior, of the same parameter in Google's original word2vec.c
code release, upon which Gensim's implementation was originally closely based.
If you'd prefer a precise vocabulary count (and that won't exhaust your RAM), look into the max_final_vocab
parameter instead. It only applies at the end of the vocabulary survey, by choosing an effective_min_count
that's large enough to come in just-under your chosen max_final_vocab
(instead of arbitrarily far-lower as is common with the max_vocab_size
parameter's pruning). Still, though, if you specified an explicit min_count
higher than the effective_min_count
that max_final_vocab
required, your higher explicit min_count
will be applied.
Two other unrelated things to watch out for, given your shown parameters:
- even if you have more CPU cores,
worker
values higher than about 6-12 usually have lower-throughput than some value in that range, due to Python GIL contention & limitations of Gensim's iterable-corpus-mode "master-reader-thread that fans batches out to many worker threads" approach. Finding the exact number of threads that achieves optimal training throughput is a matter of trial-and-error, as it's affected by other parameters as well. But, as long as your corpus's token patterns are uniform throughout, the first few minutes of a run should reflect, in the logged reported rates, a consistent rate for the full run. - the losses tracked by
compute_loss
have a bunch of caveats; see #2617 for an overview of what's amiss.
from gensim.
Related Issues (20)
- is the summarization module removed in the newest version of gensim, i find it nowhere in the documentation? HOT 1
- Docs still reference fasttext.build_vocab sentences parameter HOT 1
- EnsembleLDA with pyLDAvis visualisation
- library stubs are missing HOT 1
- Installation Error: Failed building wheel for gensim HOT 4
- Support for python3.12 HOT 2
- It fails to convert non-ascii characters in Turkish wikipedia dump. HOT 1
- Doc2Vec on Wikipedia articles HOT 1
- SyntaxError: future feature annotations is not defined HOT 4
- How can we fix this issue when i use python 3.6? HOT 2
- Please do not hardcode `libc++` HOT 4
- Where are pre-trained doc2vec model w/ recent version of Gensim
- bug about remove_markup HOT 2
- Where are pre-trained doc2vec model w/ recent version of Gensim? HOT 1
- Out-of-Period Terms in LdaSeqModel
- Gensim broken with SciPy 1.13.0 HOT 12
- The triu function is now removed from scipy module HOT 1
- Wrong parameter information in `gensim.models.keyedvectors.KeyedVectors.save()` docstring.
- Bug on Import gensim HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gensim.