Hi, why not do in every corpus, something like: <div class="snippet-clipboard-

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="61

corpus: why not update self.length after iterating all about gensim HOT 5 CLOSED

piskvorky commented on May 15, 2024

corpus: why not update self.length after iterating all

from gensim.

Comments (5)

piskvorky commented on May 15, 2024

Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.

But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :) Killing two flies at once...

from gensim.

Dieterbe commented on May 15, 2024

Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.

not in my case :)

But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :)

it does. your codebase explicitly supports "the old way" of just having the streaming corpus without an index.
AFAICT, in the case where the user does not need corpus[123456]-style document retrieval (only streaming) and where the user iterates corpus first, calls len() afterwards, there are two options for fast len():
A) tell user to use an index (for the sole purpose of speeding up len())
B) add the code I suggested

I think A is quite expensive (building and storing the index structure but only using it for len()), so I would do B. But of course, it's your decision.

from gensim.

piskvorky commented on May 15, 2024

Ok. I still think determining your input data length belongs conceptually elsewhere (i.e., not in gensim at all), but on the other hand, it's just 3 lines of code and i finally want to see how the pulls work on github :) Can you please initiate a pull request?

EDIT: (to develop branch)

from gensim.

Dieterbe commented on May 15, 2024

#4
there you go.
I went over all the corpus classes and found 2 of them that benefit from this tweak. So it's 6 lines ;)

from gensim.

Dieterbe commented on May 15, 2024

note that github automatically generates an issue on a pull request.
in this case that's issue 4:
#4

from gensim.

Recommend Projects

corpus: why not update self.length after iterating all about gensim HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent