thunlp / se-wrl Goto Github PK

Improved Word Representation Learning with Sememes

License: MIT License

C 94.77% Shell 4.36% Makefile 0.37% Python 0.51%

se-wrl's Introduction

SE-WRL

This the the lab code for the ACL 2017 paper Improved Word Representation Learning with Sememes. Sememes are minimum semantic units of word meanings, and the meaning of each word sense is typically composed by several sememes. Since sememes are not explicit for each word, people manually annotate word sememes and form linguistic common-sense knowledge bases. In this paper, we present that, word sememe information can improve word representation learning (WRL), which maps words into a low-dimensional semantic space and serves as a fundamental step for many NLP tasks. The key idea is to utilize word sememes to capture exact meanings of a word within specific contexts accurately. More specifically, we follow the framework of Skip-gram and present three sememe-encoded models to learn representations of sememes, senses and words, where we apply the attention scheme to detect word senses in various contexts. We conduct experiments on two tasks including word similarity and word analogy, and our models significantly outperform baselines. The results indicate that WRL can benefit from sememes via the attention scheme, and also confirm our models being capable of correctly modeling sememe information.

New Version

New version of SAT is released: https://github.com/thunlp/SE-WRL-SAT

Other methods are coming soon.

How to Run

Using the following command to train word-sense-sememe embeddings.

cp SSA.c[SSA.c/MST.c/SAC.c/SAT.c] word2vec/word2vec.c
cd word2vec
make
./word2vec -train TrainFile -output vectors.bin -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 30 -binary 1 -iter 3 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -min-count 50 -alpha 0.025

TrainFile is train data set. The following three files can be found in directory datasets. VocabFile is the word vocabulary file, and SememeFile is the sememe vocabulary file. Word_Sense_Sememe_File is a file recording group information of word-sense-sememe.

Before training, you should replace word2vec/word2vec.c with one of the four files SSA.c/MST.c/SAC.c/SAT.c.

Data Set

HowNet.txt is an Chinese knowledge base with annotated word-sense-sememe information.

Sogou-T(sample).txt is a sample dataset extracted from Sogou-T.

Complete training dataset Clean-SogouT is released in https://pan.baidu.com/s/1kXgkyJ9(password: f2ul).

Evaluation Set

wordsim-240.txt and wordsim-297.txt in this files are utilized to evaluate the quality of word representations.

analogy.txt in this file is utilized to evaluate models' capability of word analogy inference.

Annotation Information

The annotation information is for the four files SSA.c/MST.c/SAC.c/SAT.c. Annotation of the common code is only included in file SSA.c.

Errata

We are sorry that we have found some bugs in our algorithm implementation, and have fixed them in the github version. The new experiment results are released on GitHub as follows and we have also updated the [paper] (http://thunlp.org/~lzy/publications/acl2017_sememe.pdf). The new results still confirm the general idea and conclusion of our paper.

Word Similarity

Model	Wordsim-240	Wordsim-297
CBOW	57.7	61.1
GloVe	59.8	58.7
Skip-gram	58.5	63.3
SSA	58.9	64.0
MST	59.2	62.8
SAC	59.1	61.0
SAT	61.2	63.3

Word Analogy

Model	Capital	City	Relationship	All
CBOW	49.8	85.7	86.0	64.2
GloVe	57.3	74.3	81.6	65.8
Skip-gram	66.8	93.7	76.8	73.4
SSA	62.3	93.7	81.6	71.9
MST	65.7	95.4	82.7	74.5
SAC	79.2	97.7	75.0	81.0
SAT	82.6	98.9	80.1	84.5

se-wrl's People

Contributors

Stargazers

Watchers

se-wrl's Issues

data process

您好！
我在follow您的工作的过程中遇到了一些问题。
在复现了您的所有工作以后，我考虑使用其他的语料进行词向量的训练。在处理好语料集和vocab文件之后，接着生成自己的Word_Sense_Sememe_File时遇到了困难。
1.首先我在您的提供的代码中找到了data process的代码，在运行后得到的是空文件，自己仔细研读之后发现代码可能是不完整的，我猜想是缺少了sense级的处理和整个的循环部分，请问是这样吗？如果确实不完整请您告知我，我自己对这部分进行coding，此外假如您可以提供这部分的完整代码，方便我接下来的研究，则感激不尽！
2.您提供的Word_Sense_Sememe_File是否包含了hownet的所有词？我是否可以直接使用您提供Word_Sense_Sememe_File做对齐？
期望您的答复，谢谢！

请问只有一个sense的词的sememe id怎么得到？

Word_Sense_Sememe_File里面只有一个sense的词没有标注对应的sememe，请问可以得到吗？

一个word包含哪些sense和sememe是怎么确定的呢

请教一下，研究paper之后没有找到具体确定一个word包含哪些sense和sememe的方式，看逻辑好像是先有了这些信息，再取对应的word、sense、sememe embedding进行学习，不知道我理解的对不对。
很棒的paper，非常感谢。

未能复现 SSA 结果

作者您好，非常欣赏清华在 word embedding 上的工作，从孙茂松老师 2016 年发表在《中文信息学报》上的一篇工作开始，我们就一直追随孙茂松老师、刘致远老师的相关工作。

但是我遇到了这样的问题：clone 该 repo 以后，未对代码进行任何修改，仅按照 repo 里提示的运行参数运行 SSA 代码，未能得到比 CBOW （采用了同样的运行参数）更好的词表示评测效果。

运行参数如下：

./word2vec \
        -train ~/corpus/Clean-SogouT.txt \
        -output ~/ZhExpRet/sogouT/SE-WRL/SSA/vec$1.txt \
        -read-vocab ../datasets/VocabFile \
        -read-meaning ../datasets/SememeFile \
        -read-sense ../datasets/Word_Sense_Sememe_File \
        -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 16 \
        -binary 0 -iter 3 -min-count 50 -alpha 0.025

wordsim 评测工具使用了 gensim.evaluate_word_pairs 方法。

四次独立实验结果取平均值，实验结果如下：

	CBOW	SSA
word similarity 240	59.18	58.68
word similarity 240	62.42	61.34

VocabFile problem

Hi, when I add a word like '蛤蛤 50' to VocabFile , then I train model , but I get a error

dy@ubun:~/SE-WRL-master/datasets$ ./word2vec -train Clean-Sogou-ALL.txt -output vectors.bin -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -thre ads 30 -binary 1 -iter 3 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -min-count 50 -alpha 0.025 Starting training using file Clean-Sogou-ALL.txt Vocab size: 462667 Words in train file: 2655924607 462667 1983 Alpha: 0.024999 Progress: 0.00% Words/thread/sec: 1.83k Segmentation fault
how can I add words to VocabFile ?

how to get the "HowNet.txt"

how to extract the HowNet.txt/SememeFile/Word_Sense_Sememe_File from the original Hownet data?

所有词都在hownet上么

每个词都有多个义意S，每个义意又有多个义源X
hownet比较有限，不没有把所有词都标记了C(X)
那么没标记的词怎么处理的？

two questions about the paper

In the description of your dataset, it show " The average senses for each word are about 2.4", but in your 'Word_Sense_Sememe_File', there are 4652 words have more than one sense among 462667 words, how the number 2.4 is statisticsed??
In the 4.5.2 part of your paper, it show word 'Havana' has one sense and the the sense consists of 4 senses, the paper show the different context words have different attention on different sememes. But in your algorithm, the attention is on the senses but not the sememes(simplely using the average of sememes to represent the sense and using the sense embeddings to represent the word if the word has only one sense ), is there some misunderstanding about the case ??
Thanks for your reply!

语料库

您好，我想使用自己的语料库去训练您的模型，所以您的模型对于语料库的要求是啥，简单分词即可吗？我看到您提供的train_sample.txt里面：
据新华社伊斯兰堡 1月电（记者许）巴基斯坦外交部今天宣布，由于美国方面至今既不交付巴方已订购的，又不退还，巴政府计划就此事上诉美国法院。
随着各国人民之间交往的增多，他们的文化、风俗、习惯等等也都在相互传播，相互吸收，这应该说是件好事
不是简单的分词，这地方的,分别表示什么？

数据问题

您好：
现在代码没有问题了，是源数据文件空格和换行被修改了，导致文件解析错误，我有2个问题想请教您。
我这边想用您的创新的**训练词向量，然后在智能问答、近义词挖掘的背景下，比较效果。

我想知道您提供的Word_Sense_Sememe_File、SememeFile 这两个文件是全的么。
增量训练的接口打算开发么。
祝好！

数据预处理文件 & 其他语言 & 效率问题

请问可不可以发一下数据预处理的文件，例如生成meaning， sememe， vocabulary这三个file的code。
我想在其他语料上跑一下这个模型，然后发现这份code对input格式要求很严格，我重做了input file后就会有很多问题。一些是input格式，一些是code本身的parameter要改动。所以想要一下数据预处理文件来去掉input的问题。

在其他中文语料上不能直接使用因为vocab不一样，如果像之前的issue只简单的作对齐的话，应该会丢失信息，所以想要重做三个input file。

另外想问有没有在英语上做类似的实验？

还有就是在加了sememe后是不是会training 速度降低很多？

可以公开训练好的word embedding文件吗

运行时报错：Segmentation fault (core dumped)

你好，我用SAT.c的代码运行发现报错Segmentation fault (core dumped)，请问你之前遇到过么？

how to interpret the output

训练（binary=1）之后得到.bin文件，请问怎样interpret the output?

描述
.bin文件的每一行包括一个汉字，接着是一个integer i，embedding_dim * i 个float number，最后是一系列的integers。

我猜想：
.bin文件中每个汉字是其对应的sense，后面的integer i是这个sense对应的sememe embedding数量，再之后的float number就是实际的sememe embedding。那最后的integers是什么呢？

例如：
有两个word=“来”，第一个“来”就是来的第一个sense，后面的integer i=8，说明有8个sememes组成了这个sense，其embedding就是后面的floating numbers；第二个“来”后面没有integer，直接跟其唯一组成的sememe embedding。

我没有找到相关说明来解释程序的输出，非常感激如果有人能回答这个问题！

未能复现结果

训练结果

你好，我想利用下你们的训练的词向量做一下中文处理的下游任务，试试效果如何，不知道能否提供。

请问这个模型用C写的初衷是什么呢？

problems about evaluation

您好，我在重现结果的时候遇到了一个问题。
SAT这个模型是有用到attention机制的，而且输出的时候会输出每个sense的embedding。那么在similarity和analogy的task上，要如何使用每个sense的embedding呢？我目前是直接将所有的sense embedding加起来作为word embedding，但是结果似乎并不好，猜测可能还是需要用到context和attention，论文里似乎并没有说到这一点。可以请教一下具体是怎么实现的吗？

代码问题

您好，我这边在Windows跑您的代码，Starting training using file E:\C++Code\Word2Vector\datasets\Sougo-T(sample).txt
Vocab size: 462667
Words in train file: 2655924606
462667
1983
success load data
success InitNet
success InitUnigramTable
start train

在训练的时候，// BP
g /= total;
total存在等于0的情况，参数是readme 设置的。麻烦帮忙解答一下，谢谢。

MST.c 运行时段错误

Thread 2 "mst" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff9dd0700 (LWP 560)]
0x0000000008004d41 in TrainModelThread (id=0x0) at mst.c:792
792						g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;

执行

p syn1neg@10

发现数组的内存地址中，有部分指向了 0x0 地址。

执行

p attention@10

发现数组的内存地址中，也有部分指向了 0x0 地址。

请问该如何解决？

What are these training file look like?

Hi, I am a tyro of the area. And I am sorry that can't understand the README clearly. Could you submit some examples of your training files, I can't figure out what are these 4 training file look like? Thank you~

词义消歧

您好，在论文里有提到词义消歧，具体的消歧过程是怎样的，是否可以提供demo?

您好，关于pretrained词表规模的一点问题。

您好，请问pretrained的词表（https://cloud.tsinghua.edu.cn/d/76ab4a71efa541bd8eb3/）有475500个中文单词，而HowNet里貌似只有210000左右。请问不在HowNet里的中文单词是如何处理的（也就是HowNet中20+W的词表，最后怎么得到40+W的词向量的呢）？另外，请问有预训练的英文词表吗？期待你的回复，非常谢谢。

muti-embedding for one context word

您好，我在复现到时候突然发现了一个问题。如果使用skip-gram来加入sememe信息，对于SAC模型，是否会出现对于一个context_word有多种embedding表示的问题。
比如一句话包含 a b c d e f 六个词。上下文窗口设定为2。
那么在创建训练集时对于target word “c" 会产生训练集[c,a],[c,b],[c,d],[c,e]。当target word移动到d时，会产生训练集[d,b],[d,c],[d,e],[d,f]。可以看出b,e这两个context word对于不同target word会有不同到embedding。这种情况时如何解决的呢。
同理对于SAT模型，也可能会出现一个target_word有多种embedding表示到问题

训练和评估问题

您好！我是词向量方面的一个新手，所以有两个问题需要问您一下：
1.make
./word2vec -train TrainFile -output vectors.bin -cbow 0 -size 200 -window 8 。。。中的TrainFile是指datasets中的Hownet.txt或者sougou-T（sample）.txt吗？
2.训练出来的vector.bin，是怎么样做word analogy和word similarity评估的，有给代码吗？或者在哪能有这**些资料。
非常感谢！

vectors.bin

生成了这个文件请问一下怎么调用。我用传统的gensim模型调用不了这个训练好的词向量。

论文打不开，提示缺少字体

更新后的论文(http://thunlp.org/~lzy/publications/acl2017_sememe.pdf)打不开，提示缺少字体 “kanerc+cmmi8”，但是这种字体全网都下不到。能否重新发布一个兼容windows的论文？

word_vec的作用

您好，阅读您的代码后，想请问每一个词语的vec的作用仅仅是提供上下文向量嘛？
还是说有其他的地方使用了word_vec？

模型应用

请问两个关于模型的应用问题：

1.模型训练好之后，是否可以用于wsd？就是，输入普通文本，输出wsd后的文本？
2.模型训练好之后，是否可以用于预测义原，即，输入普通文本，输出义原的序列？

非常感谢！

您好，我目前已经跑出了SAT的结果，在similarity和analogy的mean rank指标上表现都很好，但唯独在analogy的accuracy指标上与论文中的结果相差很远。我的参数设置如下：

./word2vec -train train.txt -output SAT_complete_iter2.bin -size 200 -window 8 -alpha 0.025 -cbow 0 -negative 25 -min-count 1 -hs 0 -binary 1 -iter 2 -sample 1e-4 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -threads 10

参数都是按照论文上写的来的，只有iter由于文中没有提及，所以我设置成2。请问这个参数设置有问题吗？

Originally posted by @immrz in #8 (comment)

help ！

作者您好！
阅读了SAT.c的代码后，我有个地方不是很明白，希望得到解答，谢谢！
752行的代码，attention[q] += _exp[p] * syn0[target].mult_sense_value[p * layer1_size + q]; 请问为什么要有一句呢？这一部分是为了 get "attention" result on senses，但是我感觉存在多个义项的情况里面，之前的代码只使用了meaning_syn去得到每个义项，然后使用attention机制得到一个词的表示，和syn0中的mult_sense_value里面的值没有关系，因此我没有读懂这一句话。请问是我理解有问题吗，如果能指出十分感谢！