Comments (4)
Ok, I added the option to initialize TfidfModel
via an existing Dictionary
object, commit c65b3ff .
This can help if a) we have the dictionary in the first place (the corpus was constructed through dictionary.doc2bow
) and b) corpus iteration is slow, so the one extra pass that we save this way matters.
I also PEP8-fied the tfidf code while i was at it.
from gensim.
I find it interesting that your commit message implies this change is only beneficial "if your corpus is super slow".
I would think building your idfs from the dict dfs will almost always be faster then iterating over the corpus. even if iterating over the corpus is not i/o constrained (reading from disk, or over network, as you suggest), the corpus approach has some overhead (it retrieves full documents which may need to be converted to the BOW model, it needs to do more math, etc) whereas the dict approach has the numbers it needs right away.
Either way, I just did one testrun, and I'm not seeing a noticeable speedup yet, which suprises me...
edit: after a few more testruns, I actually think I see a speedup :) about 20-50% faster (hard to tell precisely because there is a big spread in my timings)
edit2 : oops my previous testruns were not correctly executed. I now see about x100 speedup in tfidf model building
from gensim.
Yeah it will be always faster, no question about it.
That comment meant that in a broader perspective (computing similarities, LSI, LDA, ...), the one extra pass that increments a few number is negligible. Unless the pass (corpus iteration) itself is very expensive -- then it matters. In some cases, it might matter a lot, which in my opinion outweighs the added complexity of the code; that's why I accepted this feature.
from gensim.
Aha. I see. Thanks.
from gensim.
Related Issues (20)
- Merging corpora requires converting itertools chain object to list object HOT 2
- Inconsistent documentation for LdaSeqModel
- Is there anyway to adjust the weight of the node? HOT 1
- Deprecation Warning for sparsetools namespace HOT 2
- simple_processing() str_iterator issue HOT 3
- Pretrained model for doc2vec HOT 1
- File "<string>", line 111, in finalize_options AttributeError: 'dict' object has no attribute '__NUMPY_SETUP__' when installing gensim 3.8.3 with pip install
- add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors HOT 1
- generate change log for 4.3.2
- Windows wheel broken for Python 3.10
- Compiled extensions are very slow when built with Cython 3.0.0
- Tests fail: RuntimeError: Compiled extensions are unavailable. HOT 3
- TypeError: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given HOT 2
- Search feature on website is broken HOT 1
- How to open doc2vec trained on an older version of gensim? HOT 3
- is the summarization module removed in the newest version of gensim, i find it nowhere in the documentation? HOT 1
- Vocabulary size is much smaller than requested HOT 2
- Docs still reference fasttext.build_vocab sentences parameter HOT 1
- EnsembleLDA with pyLDAvis visualisation
- library stubs are missing HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gensim.