Giter VIP home page Giter VIP logo

se-wrl's Introduction


This the the lab code for the ACL 2017 paper Improved Word Representation Learning with Sememes. Sememes are minimum semantic units of word meanings, and the meaning of each word sense is typically composed by several sememes. Since sememes are not explicit for each word, people manually annotate word sememes and form linguistic common-sense knowledge bases. In this paper, we present that, word sememe information can improve word representation learning (WRL), which maps words into a low-dimensional semantic space and serves as a fundamental step for many NLP tasks. The key idea is to utilize word sememes to capture exact meanings of a word within specific contexts accurately. More specifically, we follow the framework of Skip-gram and present three sememe-encoded models to learn representations of sememes, senses and words, where we apply the attention scheme to detect word senses in various contexts. We conduct experiments on two tasks including word similarity and word analogy, and our models significantly outperform baselines. The results indicate that WRL can benefit from sememes via the attention scheme, and also confirm our models being capable of correctly modeling sememe information.

New Version

New version of SAT is released:

Other methods are coming soon.

How to Run

Using the following command to train word-sense-sememe embeddings.

cp SSA.c[SSA.c/MST.c/SAC.c/SAT.c] word2vec/word2vec.c
cd word2vec
./word2vec -train TrainFile -output vectors.bin -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 30 -binary 1 -iter 3 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -min-count 50 -alpha 0.025

TrainFile is train data set. The following three files can be found in directory datasets. VocabFile is the word vocabulary file, and SememeFile is the sememe vocabulary file. Word_Sense_Sememe_File is a file recording group information of word-sense-sememe.

Before training, you should replace word2vec/word2vec.c with one of the four files SSA.c/MST.c/SAC.c/SAT.c.

Data Set

HowNet.txt is an Chinese knowledge base with annotated word-sense-sememe information.

Sogou-T(sample).txt is a sample dataset extracted from Sogou-T.

Complete training dataset Clean-SogouT is released in f2ul).

Evaluation Set

wordsim-240.txt and wordsim-297.txt in this files are utilized to evaluate the quality of word representations.

analogy.txt in this file is utilized to evaluate models' capability of word analogy inference.

Annotation Information

The annotation information is for the four files SSA.c/MST.c/SAC.c/SAT.c. Annotation of the common code is only included in file SSA.c.


We are sorry that we have found some bugs in our algorithm implementation, and have fixed them in the github version. The new experiment results are released on GitHub as follows and we have also updated the [paper] ( The new results still confirm the general idea and conclusion of our paper.

Word Similarity

Model Wordsim-240 Wordsim-297
CBOW 57.7 61.1
GloVe 59.8 58.7
Skip-gram 58.5 63.3
SSA 58.9 64.0
MST 59.2 62.8
SAC 59.1 61.0
SAT 61.2 63.3

Word Analogy

Model Capital City Relationship All
CBOW 49.8 85.7 86.0 64.2
GloVe 57.3 74.3 81.6 65.8
Skip-gram 66.8 93.7 76.8 73.4
SSA 62.3 93.7 81.6 71.9
MST 65.7 95.4 82.7 74.5
SAC 79.2 97.7 75.0 81.0
SAT 82.6 98.9 80.1 84.5

se-wrl's People


clpl avatar heylinsir avatar tsingularity avatar zibuyu avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

se-wrl's Issues

data process

1.首先我在您的提供的代码中找到了data process的代码,在运行后得到的是空文件,自己仔细研读之后发现代码可能是不完整的,我猜想是缺少了sense级的处理和整个的循环部分,请问是这样吗?如果确实不完整请您告知我,我自己对这部分进行coding,此外假如您可以提供这部分的完整代码,方便我接下来的研究,则感激不尽!


请教一下,研究paper之后没有找到具体确定一个word包含哪些sense和sememe的方式,看逻辑好像是先有了这些信息,再取对应的word、sense、sememe embedding进行学习,不知道我理解的对不对。

未能复现 SSA 结果

作者您好,非常欣赏清华在 word embedding 上的工作,从 孙茂松老师 2016 年发表在《中文信息学报》上的一篇工作开始,我们就一直追随孙茂松老师、刘致远老师的相关工作。

但是我遇到了这样的问题:clone 该 repo 以后,未对代码进行任何修改,仅按照 repo 里提示的运行参数运行 SSA 代码,未能得到比 CBOW (采用了同样的运行参数)更好的词表示评测效果。


./word2vec \
        -train ~/corpus/Clean-SogouT.txt \
        -output ~/ZhExpRet/sogouT/SE-WRL/SSA/vec$1.txt \
        -read-vocab ../datasets/VocabFile \
        -read-meaning ../datasets/SememeFile \
        -read-sense ../datasets/Word_Sense_Sememe_File \
        -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 16 \
        -binary 0 -iter 3 -min-count 50 -alpha 0.025

wordsim 评测工具使用了 gensim.evaluate_word_pairs 方法。


word similarity 240 59.18 58.68
word similarity 240 62.42 61.34

VocabFile problem

Hi, when I add a word like '蛤蛤 50' to VocabFile , then I train model , but I get a error

dy@ubun:~/SE-WRL-master/datasets$ ./word2vec -train Clean-Sogou-ALL.txt -output vectors.bin -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -thre ads 30 -binary 1 -iter 3 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -min-count 50 -alpha 0.025 Starting training using file Clean-Sogou-ALL.txt Vocab size: 462667 Words in train file: 2655924607 462667 1983 Alpha: 0.024999 Progress: 0.00% Words/thread/sec: 1.83k Segmentation fault
how can I add words to VocabFile ?



two questions about the paper

In the description of your dataset, it show " The average senses for each word are about 2.4", but in your 'Word_Sense_Sememe_File', there are 4652 words have more than one sense among 462667 words, how the number 2.4 is statisticsed??
In the 4.5.2 part of your paper, it show word 'Havana' has one sense and the the sense consists of 4 senses, the paper show the different context words have different attention on different sememes. But in your algorithm, the attention is on the senses but not the sememes(simplely using the average of sememes to represent the sense and using the sense embeddings to represent the word if the word has only one sense ), is there some misunderstanding about the case ??
Thanks for your reply!


据 新华社 伊斯兰堡 1月 电 ( 记者 许 ) 巴基斯坦 外交部 今天 宣布 , 由于 美国 方面 至今 既 不 交付 巴 方 已 订购 的 , 又 不 退还 , 巴 政府 计划 就 此 事 上诉 美国 法院 。
随着 各国 人民 之间 交往 的 增多 , 他们的 文化 、 风俗 、 习惯 等等 也 都 在 相互 传播 , 相互 吸收 , 这 应该 说 是 件 好事



  1. 我想知道您提供的Word_Sense_Sememe_File、SememeFile 这两个文件是全的么。
  2. 增量训练的接口打算开发么。

数据预处理文件 & 其他语言 & 效率问题

请问可不可以发一下数据预处理的文件, 例如生成meaning, sememe, vocabulary这三个file的code。
我想在其他语料上跑一下这个模型,然后发现这份code对input格式要求很严格,我重做了input file后就会有很多问题。 一些是input格式,一些是code本身的parameter要改动。 所以想要一下数据预处理文件来去掉input的问题。

在其他中文语料上不能直接使用因为vocab不一样,如果像之前的issue只简单的作对齐的话,应该会丢失信息,所以想要重做三个input file。


还有就是在加了sememe后是不是会training 速度降低很多?

how to interpret the output

训练(binary=1)之后得到.bin文件,请问怎样interpret the output?

.bin文件的每一行包括一个汉字,接着是一个integer i,embedding_dim * i 个float number,最后是一系列的integers。

.bin文件中每个汉字是其对应的sense,后面的integer i是这个sense对应的sememe embedding数量,再之后的float number就是实际的sememe embedding。那最后的integers是什么呢?

有两个word=“来”,第一个“来”就是来的第一个sense,后面的integer i=8,说明有8个sememes组成了这个sense,其embedding就是后面的floating numbers; 第二个“来”后面没有integer,直接跟其唯一组成的sememe embedding。




problems about evaluation

SAT这个模型是有用到attention机制的,而且输出的时候会输出每个sense的embedding。那么在similarity和analogy的task上,要如何使用每个sense的embedding呢?我目前是直接将所有的sense embedding加起来作为word embedding,但是结果似乎并不好,猜测可能还是需要用到context和attention,论文里似乎并没有说到这一点。可以请教一下具体是怎么实现的吗?


您好,我这边在Windows跑您的代码,Starting training using file E:\C++Code\Word2Vector\datasets\Sougo-T(sample).txt
Vocab size: 462667
Words in train file: 2655924606
success load data
success InitNet
success InitUnigramTable
start train

在训练的时候,// BP
g /= total;
total存在等于0的情况,参数是readme 设置的。麻烦帮忙解答一下,谢谢。

MST.c 运行时段错误

Thread 2 "mst" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff9dd0700 (LWP 560)]
0x0000000008004d41 in TrainModelThread (id=0x0) at mst.c:792
792						g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;


p syn1neg@10

发现数组的内存地址中,有部分指向了 0x0 地址。


p attention@10

发现数组的内存地址中,也有部分指向了 0x0 地址。


What are these training file look like?

Hi, I am a tyro of the area. And I am sorry that can't understand the README clearly. Could you submit some examples of your training files, I can't figure out what are these 4 training file look like? Thank you~



muti-embedding for one context word

比如一句话包含 a b c d e f 六个词。上下文窗口设定为2。
那么在创建训练集时对于target word “c" 会产生训练集[c,a],[c,b],[c,d],[c,e]。当target word移动到d时,会产生训练集[d,b],[d,c],[d,e],[d,f]。可以看出b,e这两个context word对于不同target word会有不同到embedding。这种情况时如何解决的呢。
同理 对于SAT模型,也可能会出现一个target_word有多种embedding表示到问题


./word2vec -train TrainFile -output vectors.bin -cbow 0 -size 200 -window 8 。。。中的TrainFile是指datasets中的Hownet.txt或者sougou-T(sample).txt吗?
2.训练出来的vector.bin,是怎么样做word analogy和word similarity评估的,有给代码吗?或者在哪能有这**些资料。









您好,我目前已经跑出了SAT的结果,在similarity和analogy的mean rank指标上表现都很好,但唯独在analogy的accuracy指标上与论文中的结果相差很远。我的参数设置如下:

./word2vec -train train.txt -output SAT_complete_iter2.bin -size 200 -window 8 -alpha 0.025 -cbow 0 -negative 25 -min-count 1 -hs 0 -binary 1 -iter 2 -sample 1e-4 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -threads 10


Originally posted by @immrz in #8 (comment)

help !

752行的代码,attention[q] += _exp[p] * syn0[target].mult_sense_value[p * layer1_size + q]; 请问为什么要有一句呢?这一部分是为了 get "attention" result on senses,但是我感觉存在多个义项的情况里面,之前的代码只使用了meaning_syn去得到每个义项,然后使用attention机制得到一个词的表示,和syn0中的mult_sense_value里面的值没有关系,因此我没有读懂这一句话。请问是我理解有问题吗,如果能指出十分感谢!



Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.