Giter VIP home page Giter VIP logo

ngram2vec's Introduction

Ngram2vec

Ngram2vec toolkit is originally used for reproducing results of the paper Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics , aiming at learning high quality word embedding and ngram embedding.

Thansks to its well-designed architecture (we will talk about it later), ngram2vec toolkit provides a general and powerful framework, which is able to include researches of a large amount of papers and many popular toolkits such as word2vec. Ngram2vec toolkit allows researchers to learn representations upon co-occurrence statistics easily. Ngram2vec can generate embeddings of different granularities (beyond word embedding). For example, ngram2vec toolkit could be used for learning text embedding. Text embeddings trained by ngram2vec are very competitive. They outperform many deep and complex neural networks and achieve state-of-the-art results on a range of datasets. More details will be released later.

Ngram2vec has been successfully applied on many projects. For example, Chinese-Word-Vectors provides over 100 Chinese word embeddings with different properties. All embeddings are trained by ngram2vec toolkit.

The original version (v0.0.0) of ngram2vec can be downloaded on github release. Python2 is recommended. One can download ngram2vec v0.0.0 for reproducing results.

Features

Ngram2vec is featured by decoupled architecture. The process from raw corpus to final embeddings is decoupled into multiple modules. This brings many advantages compared with other toolkits.

  • Well-organized: The ngram2vec toolkit is easy to read and understand.
  • Extensibility: One can add co-occurrence statistics and embedding models with little effort.
  • Intermediate results reuse: Intermediate results are written to disk and reused later, which largely boosts the efficiency in both speed and space.
  • Comprehensive: Ngram2vec includes a large amount of works related with word embedding
  • Embeddings of different linguistic units: Ngram2vec can learn embeddings of different linguistic units. For example, ngram2vec is able to produce high-quality text embeddings which achieve SOTA reults on a range of datasets.

Requirements

  • Python (both Python2 and 3 are supported)
  • numpy
  • scipy
  • sparsesvd

Example use cases

Firstly, run the following codes to make some files executable.
chmod +x *.sh
chmod +x scripts/clean_corpus.sh
python scripts/compile_c.py

Also, a corpus should be prepared. We recommend to fetch it at
http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 , a wiki corpus without XML tags. scripts/clean_corpus.sh is used for cleaning English corpus.
For example scripts/clean_corpus.sh WestburyLab.wikicorp.201004.txt > wiki2010.clean
A pre-processed (including segmentation) chinese wiki corpus is available at https://pan.baidu.com/s/1kURV0rl , which can be directly used as input of this toolkit.

run ./word_example.sh to see baselines
run ./ngram_example.sh to introduce ngram into recent word representation methods inspired by traditional language modeling problem.br>

Workflow

Testsets

Besides English word analogy and similarity datasets, we provide several Chinese analogy datasets, which contain comprehensive analogy questions. Some of them are constructed by directly translating English analogy datasets. Some are unique to Chinese. I hope they could become useful resources for evaluating Chinese word embedding.

References

@inproceedings{DBLP:conf/emnlp/ZhaoLLLD17,
     author = {Zhe Zhao and Tao Liu and Shen Li and Bofang Li and Xiaoyong Du},
     title = {Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics},   
     booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September 9-11, 2017},      
     year = {2017}
 }

Acknowledgments

This toolkit is inspired by Omer Levy's work http://bitbucket.org/omerlevy/hyperwords
We reuse part of his code in this toolkit. We also thank him for his kind suggestions.
I also got the help from Bofang Li, Prof. Ju Fan, and Jianwei Cui in Xiaomi.
My tutors are Tao Liu and Xiaoyong Du

Contact us

We are looking forward to receiving your questions and advice to this toolkit. We will reply you as soon as possible. We will further perfect this toolkit.

Zhe Zhao, [email protected], from DBIIR lab
Shen Li, [email protected]
Renfen Hu, [email protected]

ngram2vec's People

Contributors

zhezhaoa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ngram2vec's Issues

Divide by zero error with 'seen' variable

I'm getting this divide by zero error during the evaluation of every embedding when doing ngram_ngram.

Counts2glove finished
seen/total: 1/203
testsets/similarity/ws353_similarity.txt: nan
Traceback (most recent call last):
  File "/home/mpickard/Projects/ngram2vec/ngram2vec/ngram2vec/analogy_eval.py", line 102, in <module>
    main()
  File "/home/mpickard/Projects/ngram2vec/ngram2vec/ngram2vec/analogy_eval.py", line 75, in main
    accuracy_add = float(correct_add) / seen
ZeroDivisionError: float division by zero

I'm using Python 3. Any ideas on why seen is zero in the analogy.eval.py code? And why sim_actual and sim_expected end up with a correlation of zero in the similarity_eval.py code? I was trying the code on a small corpus.

SGNS中在 Context 中添加汉字特征

Hi,感谢贵组开源了ngram2vec工具~

我从CA8论文中了解到Context中添加ngram+char+word的embedding在中文语料中效果很好。
我想要训练自己语料库的SGNS ,context为ngram+char+word,ngrm2vec工具包已经实现了context中添加了ngram,请问如果要在context中添加char特征需要做哪些工作呢?

期待您的回复。

What is a input vector file? And workflow diagram confusion.

Hello, i am using your paper for my own thesis research. Looking at the workflow.jpg diagram, the arrows are a bit confusing. I am trying to use the Skip-Gram ngram-ngram method. From what i understand, it seems that i have to go through the steps corpus2vocab -> corpus2pairs -> paris2sgns. But paris2sgns requires an "--input_vector_file" argument. I dont know what that is and the steps didnt generate one. I assume its the resulting word embeddings vectors in a file, but if i have that, then i wouldnt be using the tool. Do i have to run the original word2vec SG method and save a .vec model and use it here? I read the research paper and didnt find an answer to this either. I also tried pairs2vocab, but it also doesnt generate the input vector file.

A separate issue is with the corpus2pairs; it generates 4 different .txt files (pairs.txt_0, pairs.txt_1, pairs.txt_2, pairs.txt_3), when i give the argument "--pairs_file ./pairs.txt". Then later do i have to run paris2sgns for all pairs files? Do i generate different output vector files for each? Do the vector files get overwritten or appended to?

Is there a problem with overlap verification in line2pairs ?

In line2pairs, line 53 and 56, it checks whether the input ngram and the output ngram overlap (completely or partially).
For complete overlap, I reckon you have to check if the first token of each ngram is the same and if both ngrams are the same length. However, this last check is performed with the input_order and output_order variables, that don't represent those particular ngrams' length but the maximum ngram's length to search for in the line. For example, if you have input_order = 1, output_order = 2 and overlap = True, you will never pass the input_order = output_order check and therefore you will eventually get ngrams paired with themselves.
The same thing happens with the partial overlap check.

Shouldn't line 53 be
if i == l and j == k:
instead of
if i == l and input_order == output_order:

And line 56
if len(set(range(i, i + j)) & set(range(l, l + k))) > 0:
instead of
if len(set(range(i, i + j)) & set(range(l, l + k))) > 0:

bigram的实现

谢谢开发者的分享!!!
我想实现bigram,请问实现bigram的源码在哪个文件夹呢?

中文语料失效

您好,请问中文语料的格式是什么样的?百度网盘的连接现在不能下载了。
我用自己的语料训练会报错?不知道能够提供一下中文语料的格式,谢谢!

关于input_vector_file?

您好,首先感谢贵组的开源,我想咨询下在最后生成向量里面的input_vector_file是
e
什么文件?vocab,counts这些文件都是预先生成的,但是关于这个文件比较模糊,非常感谢!

Question

应该如何安装representations.matrix_serializer,我的pip找不到相应的版本?

怎么训练中文word+character+ngram 的Context特征

你好,最近在用ngram2vec工具,有点困惑,要得到word+character+ngram这种context Features,我的语料要怎么处理呢?分词还是分字?
如果是分词的话,脚本里要怎么传参数才能得到character特征呢? 我在代码里看没有找到这部分内容

linux运行到ngram2vec/pairs2sgns.py报错

后面的也走不了了
Pairs2sgns
Traceback (most recent call last):
File "ngram2vec/pairs2sgns.py", line 53, in
main()
File "ngram2vec/pairs2sgns.py", line 47, in main
return_code = subprocess.call(command)
File "/home/anaconda3/lib/python3.6/subprocess.py", line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/anaconda3/lib/python3.6/subprocess.py", line 709, in init
restore_signals, start_new_session)
File "/home/anaconda3/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory:'./word2vec/word2vec': './word2vec/word2vec'
Traceback (most recent call last):
File "ngram2vec/similarity_eval.py", line 71, in
main()
File "ngram2vec/similarity_eval.py", line 41, in main
matrix, vocab, _ = load_dense(args.input_vector_file)
File "/home/yongzhuo/ngram2vec/ngram2vec/utils/matrix.py", line 23, in load_dense
with codecs.open(path, "r", "utf-8") as f:
File "/home/anaconda3/lib/python3.6/codecs.py", line 897, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'outputs/wikipedia/word_word/sgns/sgns.input'

关于新版本中未提供一些训练参数?

首先感谢大佬的开源,我想问一下为什么新版中pairs2sgns.py或者counts2glove没有提供一些训练参数接口,例如window,min_count。都是使用默认的?如果想要调整需要修改c代码,或者不使用python,直接c版本运行?最后再次谢谢大佬。

question about subsampler

subsampler = dict([(word, 1 - sqrt(subsample / count)) for word, count in six.iteritems(vocab) if count > subsample]) #subsampling technique

I am confused about the sub-sampler in corpus2pairs. I think 1 - sqrt(subsample / count) should be replaced with 1 - sqrt(subsample / (count / total_word_count_in_vocab)).

ps. I might misunderstand your implementation, and in actual implementation of original word2vec.c ,the subsample probability equals 1 - (sqrt(subsample / (count / total_word_count_in_vocab)) + subsample / (count / total_word_count_in_vocab) ).

Process 6-grams

Can someone verify that I'm thinking correctly about how to generate 6-grams. In the ngram_example.sh script, I simply change the "order" and "output_order" values to 6, correct?

...
python ngram2vec/corpus2vocab.py --corpus_file ${corpus} --vocab_file ${output_path}/vocab --memory_size ${memory_size} --feature ngram --order 6
python ngram2vec/corpus2pairs.py --corpus_file ${corpus} --pairs_file ${output_path}/pairs --vocab_file ${output_path}/vocab --processes_num ${cpus_num} --cooccur ngram_ngram --input_order 1 --output_order 6
...

error: ./word2vecf/word2vecf: cannot execute binary file

Hi,首先感谢贵组将开源embedding工具ngram2vec。我在用ngram2vec时出错,想请教一下。

uni-bi.sh 执行出错 ./word2vecf/word2vecf: cannot execute binary file

Mac OS 10.13.6

查阅资料说是 在Linux下的执行文件在MacOS上无法运行。
https://superuser.com/questions/724301/how-to-solve-bash-cannot-execute-binary-file

According to your file output, this program is for GNU/Linux. I know this because:

The file b1 is in the ELF (Extensible and Linkable Format) format, while Mac OS X uses the Mach-O format for binaries;

file recognizes this file is for GNU/Linux 2.6.18, meaning it'll work on most modern Linux distributions.

To solve your problem, you must either run this problem within a Linux distribution, recompile the program, or get the Mac OS X version of this program.

想请问uni_bi.sh 计算SGNS的word2vecf/word2vecf是否为word2vec/makefile执行编译后生成的二进制文件呢? 可否用macOS重新编译呢?
······································
另外,为什么demo_simplified.sh 不需要执行word2vecf二进制文件,执行word2vecf.py也可以计算SGNS,word2vecf二进制执行文件 和 word2vecf.py是什么关系呢?

type error

I'm trying to create uni_bi.sh with Chinese/utf8 word seg file, however always got following error.
any idea ?

==========
Traceback (most recent call last):
File "ngram2vec/pairs2counts.py", line 109, in
main()
File "ngram2vec/pairs2counts.py", line 88, in main
counts_file.write(str(old[0]) + " " + str(w) + " " + str(old[1][w]) + "\n")
TypeError: write() argument 1 must be unicode, not str

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.