thunlp / nre Goto Github PK
View Code? Open in Web Editor NEWNeural Relation Extraction, including CNN, PCNN, CNN+ATT, PCNN+ATT
License: MIT License
Neural Relation Extraction, including CNN, PCNN, CNN+ATT, PCNN+ATT
License: MIT License
mldl@mldlUB1604:/ub16_prj/NRE/CNN+ATT$ ./train/ub16_prj/NRE/CNN+ATT$ ll ../data
Init Begin.
wordTotal= 114042
Word dimension= 50
Segmentation fault (core dumped)
mldl@mldlUB1604:
total 215124
drwxr-xr-x 2 mldl mldl 4096 5月 2 03:29 ./
drwxrwxr-x 10 mldl mldl 4096 5月 2 03:29 ../
-rw-r--r-- 1 mldl mldl 584570 7月 16 2016 entity2id.txt
-rw-r--r-- 1 mldl mldl 1851 7月 16 2016 relation2id.txt
-rw-r--r-- 1 mldl mldl 48268627 7月 16 2016 test.txt
-rw-r--r-- 1 mldl mldl 147456013 7月 16 2016 train.txt
-rw-r--r-- 1 mldl mldl 23955231 7月 16 2016 vec.bin
mldl@mldlUB1604:~/ub16_prj/NRE/CNN+ATT$
Hi,
I'm trying to reproduce the PR curves in the paper. However I find that the pr.txt files in the repository does not match the curves reported in the paper (PCNN+ATT for example).
Are these files generated by models that are not fully trained?
Can you provide pr.txt files that can reproduce curves in the paper?
Much appreciated.
For the code at: https://github.com/thunlp/NRE/blob/master/CNN%2BATT/train.cpp
I can not follow how to calculate gradients and update parameters, maybe from line 193 to line 238.
Could anyone explain, please?
@Mrlyk423
数据说明中 train.txt: training file, format (fb_mid_e1, fb_mid_e2, e1_name, e2_name, relation, sentence).
其中fb_mid_e1,fb_mid_e2是位置信息,但是具体怎么定义的呢?
Hi,
Thanks for the great work. I'm trying to reproduce the results but having some trouble.
First, I just used the "pr.txt" file in the "NRE/PCNN+ATT/out/" directory and plotted the PR curve.
Then, I re-ran the test part of the program (without retraining) and plotted the PR curve with the newly generated "pr.txt" file.
Somehow, they are different and also not the same as figure 3 in the paper. Could you please elaborate on the possible reasons? Thanks.
Hi,
[s CNN+ATT]$ ls
init.h log.txt makefile out test.cpp test.h train.cpp
[s CNN+ATT]$ make
g++ train.cpp -o train -O2 -lpthread
g++ test.cpp -o test -O2 -lpthread
[sharmistha@momo CNN+ATT]$ ./train
Init Begin.
All the training files are in the correct directory. Please let me know how to resolve this issue.
Thanks
你好,现在大多模型在输入的时候都采取了将句子按照实体划分为3段的方法,这个时候每段的长度可以pad或者trancate到定长。假如我使用trancate的方法,那么两个实体间的部分怎么trancate呢?
比如XXX Obama XXXXXXXXXXXXX USA XXX. Obama左边和USA右边部分如果太长的话可以删掉远离实体的单词,但是Obama和USA中间的部分怎么处理呢?谢谢!
ubuntu16.04 编译后,执行命令./test后,出现段错误
I try to reproduce the cnn/pcnn with tensorflow. In test phase, I treat a sentence as a special bag, and draw Precision-Recall curve. Surprisingly, single CNN/PCNN get better performance than CNN/PCNN+ONE/ATT。
Another question:the performance of various models look very similar,while in papers, the performance of various models look very different?
您论文里提到训练集有522611句子、测试集有172448句子。但在您发布的data.zip文件中测试集行数为172448,但句子去重后为61707;训练集行数为570088,句子去重后为368099,即使句子+实体对+关系联合再去重后也是510415,而非522611。
请问是哪里出了问题?您论文中的“句子数量”指的是什么?
Hi, some thing about the attention confused me a lot.
the r is the query vector with relation r (the relation representation).
In train phase, is it r is the target relation label? if so, when in test phase, which r should be chosen to calculate the attention weight for the instances in a bag?
Do I misunderstand something about the paper?
Thanks.
P@N是什么意思呀,N代表什么
In the line 12 of PCNN+ATT/test.h, the code is
for (int i = 0; i < 3 * dimensionC; i++) {
However, I think the code should be
``for (int i = 0; i < dimensionC; i++) {`
which is similar to the 44th line of train.cpp.
I think the original code in test.h actually assumes that there are 3 * dimensionC
convolution kernels while actually there are only dimensionC
kernels.
I tested the new version of the test code with the released trained parameters and the PR curve matches the curve in the paper. And the speed is much faster than the original code.
My file like this:
V70驱动器报警F30001过电流”
电机动力电缆干扰
您好:
我在使用您的代码时,完全按照您的readme进行操作,之后发现输出在屏幕上的train最后一遍的p,r值和test一致。刚刚开始学习自然语言处理,不知道怎么回事,所以麻烦您了。
./train
…
tot:1950
persicon:1 recall :0.00512821
persicon:0.782178 recall :0.0405128
…
./test
tot:1950
persicon:1 recall :0.00512821
persicon:0.782178 recall :0.0405128
…
Hi.
when I read the codes one by one, I found PCNN+ATT don't have any attention part while CNN+ATT has that.
Concretely, "init.h" in the "CNN+ATT" directory mentions the attention-related matrix called "att_W" and "att_W_Dao" which are used and trained in the trainining step of "train.cpp" but in "PCNN+ATT", I haven't found the parts.
Could you check them??
When I use the parameters in the out folder to run the test.cpp, I couldn't get the same result in your chart. Could you provide the parameters of the generating charts? I would be very grateful.
Would you like to share how the word embedding file was created, like what procedure was used. And also if I want this algorithm to work on my dataset, how am I supposed to create a word embedding file for my dataset
在加载train.txt数据时发现存在不在relation2id.txt的关系类型,如:/people/ethnicity/includes_groups ,阅读NRE中代码发现似乎将其看做NA关系。
请问,这些未记录到relation2id.txt的关系,是因为其出现的次数过少吗?如果是这样的话,将其处理为NA或直接忽略都可以吗?
Hi, in /CNN+ATT/test.h, line 114, it seems that the function vector<double> test(int *sentence, int *testPositionE1, int *testPositionE2, int len, float *r)
is never used, since vector<double> score
is never used in the following codes.
Therefore, * r
, which should represent sentence encoding in function vector<double> test(int *sentence, int *testPositionE1, int *testPositionE2, int len, float *r)
, now is a random vector in void* testMode(void *id )
and is then pushed into r_tmp
.
Did I miss something? Thanks!
data压缩包中的test.txt文件中的关系类型基本部分都是NA?只有小部分数据正常给出关系类型。
From this line of code, It seems that the first match between a head(tail) entity and a word is treated as right entity mention in the sentence. But in the case when a sentence has several mentions of the entity it is not necessarily true.
head: brooklyn
tail: eastern parkway
sentence: brooklyn museum , 200 eastern parkway , brooklyn , (718) 638-5000 .
Original dataset contains necessary index information, but it seems that preprocessed data in this repo doesn't have it.
Please correct me if I am wrong.
你好,
我想问一下这个P@N测试是怎么做的?
这一步的ground_truth label是用的distant supervision的结果还是通过人工的方式来判断呢?
我认为是应该用人工的方式判断,但https://github.com/thunlp/TensorFlow-NRE 这份代码中,做P@N测试用的是测试集上的distant supervision得到的结果作为ground_truth。
望告知,谢谢!
你好,
在准确率计算的时候是否包含,NA类型?因为训练语料中 NA类型数据占比近80%,所以很多数据会极有可能预测为NA类型,而测试数据中NA占近90%,所以如果包含NA类型,准确率确实很高,但是如果去除NA类型计算的准确率非常低。
我通过 http://iesl.cs.umass.edu/riedel/ecml/ 直接解析出的实体ID都是以/guid/开头,而NRE中处理出的实体id是以m.xxx开头,请问这两个id的对应关系是怎么得到的?
When I ran the test program in CNN+ATT, a SEGV signal occurred in init.h:71
=================================================================
==25628==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000000c0 (pc 0x7fa794dc2908 bp 0x7ffca4461380 sp 0x7ffca4460ca0 T0)
#0 0x7fa794dc2907 in _IO_vfscanf (/lib/x86_64-linux-gnu/libc.so.6+0x5b907)
#1 0x7fa795c415d0 in vfscanf (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x525d0)
#2 0x7fa795c41749 in __interceptor_fscanf (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x52749)
#3 0x402c1e in init() /home/mfc_fuzz/NRE/CNN+ATT/init.h:71
#4 0x40e400 in main /home/mfc_fuzz/NRE/CNN+ATT/test.cpp:99
#5 0x7fa794d8782f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#6 0x4028e8 in _start (/home/mfc_fuzz/NRE/CNN+ATT/test+0x4028e8)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 _IO_vfscanf
==25628==ABORTING
How did you (or Riedel) configure the dataset regarded to the order of entities in sentence.
In training set, there are instances related with both entity pair (e1, e2) and another pair (e2, e1).
In addition, not only those entity pairs don't share sentence instances (relation mentions), but also each pair doesn't have any order of appearance of entities. (I mean, for (e1, e2) entity pair, there are both sentence in which e1 appears before e2, and e2 appears before e1)
I think the order of entities is important as PCNN uses position embedding.
If there is no triple related with (e1, e2) entity pair in Freebase, what sentences are assigned for training instance for (e1, e2)-None and what sentences are assigned for (e2, e1)-None?
Thank you :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.