Suggested on project discussion list (<a href="https://groups.google.com/g/gensim/c/Cs

add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors about gensim HOT 1 OPEN

gojomo commented on May 30, 2024

add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors

from gensim.

Comments (1)

gojomo commented on May 30, 2024

My thoughts:

A desire for help here has come up a lot – & at times I've shared my observations about what can be deduced from the limited statements, & observable contents, of pre-trained vector sets like the 'GoogleNews' release.

However, without disclosures (or better yet code) from the original researchers who prepared such pretrained vectors, all such efforts will only ever be gradually-approximating their practices, with lingering exceptions & caveats generating more questions.

Also: it often seems to be beginner & small-data projects that are most-eager to re-use pretrained vectors from elsewhere, under the assumption those must be the "right" thing, or better than what they'd achieve. But: many times that's not the case.

For example, GoogleNews was trained on an internal Google corpus of news articles 11+ years ago. It used a statistical model for creating multiword-tokens whose exact parameters/word-frequencies/multigram-frequencies has never been disclosed. For many current projects, word-vectors trained on more-recent domain-specific data via understood & conciously-chosen proprocessing – even much less data! – will likely generate better vocabulary & relevant-word-sense coverage than Google's old work.

So while I'd see some value in a "best guess" function to mimic the tokenizing choices of those commonly-used pretrained sets – as a research effort, or contribution – I'd also prefer it prominently-disclaimered as non-official, & not-necessarily-an-endorsement of preferring those vectors, and that tokenization, for anyone's particular purpose.

At this point, devising such helpers would be a sort of software-archeology/mystery project, and I'd not see it as any sort of urgent priority. But, it might make a good new-contributor, student, or hackathon project – especially if eventual integration includes good surrounding docs/discussion/demos of the limits/considerations involved in reusing another project's vectors/preprocessing choices.

from gensim.

Recommend Projects

add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors about gensim HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent