Comments (5)
Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.
But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :) Killing two flies at once...
from gensim.
Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.
not in my case :)
But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :)
it does. your codebase explicitly supports "the old way" of just having the streaming corpus without an index.
AFAICT, in the case where the user does not need corpus[123456]-style document retrieval (only streaming) and where the user iterates corpus first, calls len() afterwards, there are two options for fast len():
A) tell user to use an index (for the sole purpose of speeding up len())
B) add the code I suggested
I think A is quite expensive (building and storing the index structure but only using it for len()), so I would do B. But of course, it's your decision.
from gensim.
Ok. I still think determining your input data length belongs conceptually elsewhere (i.e., not in gensim at all), but on the other hand, it's just 3 lines of code and i finally want to see how the pulls work on github :) Can you please initiate a pull request?
EDIT: (to develop
branch)
from gensim.
#4
there you go.
I went over all the corpus classes and found 2 of them that benefit from this tweak. So it's 6 lines ;)
from gensim.
note that github automatically generates an issue on a pull request.
in this case that's issue 4:
#4
from gensim.
Related Issues (20)
- Merging corpora requires converting itertools chain object to list object HOT 2
- Inconsistent documentation for LdaSeqModel
- Is there anyway to adjust the weight of the node? HOT 1
- Deprecation Warning for sparsetools namespace HOT 2
- simple_processing() str_iterator issue HOT 3
- Pretrained model for doc2vec HOT 1
- File "<string>", line 111, in finalize_options AttributeError: 'dict' object has no attribute '__NUMPY_SETUP__' when installing gensim 3.8.3 with pip install
- add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors HOT 1
- generate change log for 4.3.2
- Windows wheel broken for Python 3.10
- Compiled extensions are very slow when built with Cython 3.0.0
- Tests fail: RuntimeError: Compiled extensions are unavailable. HOT 3
- TypeError: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given HOT 2
- Search feature on website is broken HOT 1
- How to open doc2vec trained on an older version of gensim? HOT 3
- is the summarization module removed in the newest version of gensim, i find it nowhere in the documentation? HOT 1
- Vocabulary size is much smaller than requested HOT 2
- Docs still reference fasttext.build_vocab sentences parameter HOT 1
- EnsembleLDA with pyLDAvis visualisation
- library stubs are missing HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gensim.