Giter VIP home page Giter VIP logo

Comments (2)

DavidNemeskey avatar DavidNemeskey commented on May 30, 2024 1

@gojomo Thanks for the very detailed reply. Now I feel completely stupid: I skimmed the function arguments up to max_worker_size, was happy that I found it, and never managed to take a look at the argument right below it. Ungh. In any case, I think it would have made sense to make semantic hyper-parameters separate from implementation details (i.e. max_final_vocab, epochs, etc. from max_vocab_size and the like), but no use crying over spilt milk, I guess.

I have also since realized why min_count was applied at the end, so that part indeed works as it should.

Thank you for the additional heads up about worker count and the loss; I was aware of the first one, but not the second.

from gensim.

gojomo avatar gojomo commented on May 30, 2024

max_vocab_size is an awfully-named parameter with unintuitive effects during its mid-count-pruning that in general should not be used unless there's no other way to complete the vocabulary-scan in your RAM without it. And, even then, you'll want to set max_vocab_size to as large of a value as is possible with your RAM, to minimize the count-fouling effects of mid-count pruning, rather than anything near the final sized vocabulary you want.

As you note, the escalating-floor simply means the next pruning will automatically discard all knowledge of tokens with fewer occurrences, at that pruning. Still, at the very end of the vocabulary-survey, there could be any number of tokens with tallied occurrences less than that threshold that have been noted since the last prune. And then only those with frequency less than the effective_min_count (here just your specified min_count) will be ignored for the final surviving vocabulary.

So many tokens with interim counts up to that rising threshold will have been pruned during the many in-count prunes. And further, tokens with true counts far higher than that threshold, but below that threshold at one or more of the prunings, will have artifically-lower counts (because earlier interim tallies were discarded) and may wind up being ignored entirely (if their final interim count is below min_count).

Why is such a confusing parameter available? It matches the name, and behavior, of the same parameter in Google's original word2vec.c code release, upon which Gensim's implementation was originally closely based.

If you'd prefer a precise vocabulary count (and that won't exhaust your RAM), look into the max_final_vocab parameter instead. It only applies at the end of the vocabulary survey, by choosing an effective_min_count that's large enough to come in just-under your chosen max_final_vocab (instead of arbitrarily far-lower as is common with the max_vocab_size parameter's pruning). Still, though, if you specified an explicit min_count higher than the effective_min_count that max_final_vocab required, your higher explicit min_count will be applied.

Two other unrelated things to watch out for, given your shown parameters:

  • even if you have more CPU cores, worker values higher than about 6-12 usually have lower-throughput than some value in that range, due to Python GIL contention & limitations of Gensim's iterable-corpus-mode "master-reader-thread that fans batches out to many worker threads" approach. Finding the exact number of threads that achieves optimal training throughput is a matter of trial-and-error, as it's affected by other parameters as well. But, as long as your corpus's token patterns are uniform throughout, the first few minutes of a run should reflect, in the logged reported rates, a consistent rate for the full run.
  • the losses tracked by compute_loss have a bunch of caveats; see #2617 for an overview of what's amiss.

from gensim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.