Giter VIP home page Giter VIP logo

tx-word2vec-small's Introduction

TX-WORD2VEC

腾讯开源的word2vec模型。

原版15个G,一般爱好者很难玩出来。

所以做了一些小的。方便大家使用。

5000-small.txt 这个有5000词,可以下下来玩玩

45000-small.txt 这个有4.5w的词,已经能解决很多问题了

70000-small.txt 7w词 133MB https://pan.baidu.com/s/1DprHD8HwEqkWRBG0ss2y1A

100000-small.txt 10w词 190MB https://pan.baidu.com/s/1KqPOwfrw3KoLJqTsCUdriA

500000-small.txt 50w词 953MB https://pan.baidu.com/s/1SGwxpGW8HjYw8HdKQUB8Gw

1000000-small.txt 100w词 1.9GB https://pan.baidu.com/s/1ObstPl7R8o1L98Ag9owGiw

2000000-small.txt 200w词 3.8GB https://pan.baidu.com/s/1hmCiMandgyedjmP520_Aog

再大就自己去下载吧

https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz

如何使用

读取模型

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("50-small.txt")

把玩模型

model.most_similar(positive=['女', '国王'], negative=['男'], topn=1)

model.doesnt_match("上海 成都 广州 北京".split(" "))

model.similarity('女人', '男人')

model.most_similar('特朗普',topn=10)

深度模式示例

使用LSTM模型,根据豆瓣评论,预测打分。

首先下载豆瓣的数据

豆瓣评论数据149M https://pan.baidu.com/s/1WbqoCKsmrnpf6n5ZTV-fKA

然后下载库对应的分词包。 https://pan.baidu.com/s/19busyY1yysbOgdYWxIaIQA

加载70000字典前 Image text

加载70000字典后 Image text

代码文件见 Use Tencent Word Embeddings with douban datasets.ipynb

抛转引玉,如果有哪位朋友有功夫更新其他的模型或者数据库,还可以request pull

有问题也可以开issue

都看到这里了,加入QQ群跟老司机们一起讨论下吧:

python自然语言处理 644612230

image

tx-word2vec-small's People

Contributors

cliuxinxin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.