Problem deion I was training a w2v model on a rather large

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Vocabulary size is much smaller than requested about gensim HOT 2 CLOSED

DavidNemeskey commented on May 30, 2024

Vocabulary size is much smaller than requested

from gensim.

Comments (2)

DavidNemeskey commented on May 30, 2024 1

@gojomo Thanks for the very detailed reply. Now I feel completely stupid: I skimmed the function arguments up to max_worker_size, was happy that I found it, and never managed to take a look at the argument right below it. Ungh. In any case, I think it would have made sense to make semantic hyper-parameters separate from implementation details (i.e. max_final_vocab, epochs, etc. from max_vocab_size and the like), but no use crying over spilt milk, I guess.

I have also since realized why min_count was applied at the end, so that part indeed works as it should.

Thank you for the additional heads up about worker count and the loss; I was aware of the first one, but not the second.

from gensim.

gojomo commented on May 30, 2024

max_vocab_size is an awfully-named parameter with unintuitive effects during its mid-count-pruning that in general should not be used unless there's no other way to complete the vocabulary-scan in your RAM without it. And, even then, you'll want to set max_vocab_size to as large of a value as is possible with your RAM, to minimize the count-fouling effects of mid-count pruning, rather than anything near the final sized vocabulary you want.

As you note, the escalating-floor simply means the next pruning will automatically discard all knowledge of tokens with fewer occurrences, at that pruning. Still, at the very end of the vocabulary-survey, there could be any number of tokens with tallied occurrences less than that threshold that have been noted since the last prune. And then only those with frequency less than the effective_min_count (here just your specified min_count) will be ignored for the final surviving vocabulary.

So many tokens with interim counts up to that rising threshold will have been pruned during the many in-count prunes. And further, tokens with true counts far higher than that threshold, but below that threshold at one or more of the prunings, will have artifically-lower counts (because earlier interim tallies were discarded) and may wind up being ignored entirely (if their final interim count is below min_count).

Why is such a confusing parameter available? It matches the name, and behavior, of the same parameter in Google's original word2vec.c code release, upon which Gensim's implementation was originally closely based.

If you'd prefer a precise vocabulary count (and that won't exhaust your RAM), look into the max_final_vocab parameter instead. It only applies at the end of the vocabulary survey, by choosing an effective_min_count that's large enough to come in just-under your chosen max_final_vocab (instead of arbitrarily far-lower as is common with the max_vocab_size parameter's pruning). Still, though, if you specified an explicit min_count higher than the effective_min_count that max_final_vocab required, your higher explicit min_count will be applied.

Two other unrelated things to watch out for, given your shown parameters:

even if you have more CPU cores, worker values higher than about 6-12 usually have lower-throughput than some value in that range, due to Python GIL contention & limitations of Gensim's iterable-corpus-mode "master-reader-thread that fans batches out to many worker threads" approach. Finding the exact number of threads that achieves optimal training throughput is a matter of trial-and-error, as it's affected by other parameters as well. But, as long as your corpus's token patterns are uniform throughout, the first few minutes of a run should reflect, in the logged reported rates, a consistent rate for the full run.
the losses tracked by compute_loss have a bunch of caveats; see #2617 for an overview of what's amiss.

from gensim.

Vocabulary size is much smaller than requested about gensim HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent