buppt / chinesener Goto Github PK
View Code? Open in Web Editor NEW中文命名实体识别,实体抽取,tensorflow,pytorch,BiLSTM+CRF
中文命名实体识别,实体抽取,tensorflow,pytorch,BiLSTM+CRF
这个的关系抽取部分咋哪里呢?
已经下载了您的项目,但是初学者面对这些内容不知道该从何下手。博主有什么建议掌握项目运行的步骤呢?
错误代码:
train len: 10721
test len: 3351
valid len 2681
Traceback (most recent call last):
File "train.py", line 141, in
calculate(x_test,y_test,epoch)
File "train.py", line 63, in calculate
if j<len(y) and id2tag[y[j]][0]=='B':
IndexError: string index out of range
中文版:
python重现了该代码:
https://gitee.com/chashaozgr/noteLibrary/tree/master/nlp_trial/ner/src/bilstm_crf
用的人民日报的数据,python3,tensorflow==1.12
准确率确如readme所示,但是从混淆矩阵看来,由于用了padding的方法进行了预测,所以实际为0类的量(即补充部分)远比其他类多,导致样本标签不均衡,所以准确性不可信,85%+的准确性大部分来源于0类分给0类,如果缩短padding长度,precision会迅速下降。
看看大家有没有什么对策。
English version:
I reproduced the code here:
https://gitee.com/chashaozgr/noteLibrary/tree/master/nlp_trial/ner/src/bilstm_crf
Data source People’s Daily
Environment: python3.6, tensorflow==1.12
The accuracy tested is the same as shown in the readme, but according to the confusion matrix, it does not well as expected. Since the padding method is used for prediction, the number of samples with class 0, the supplementary part is much more than the other classes, resulting in unbalanced sample tags. Therefore, the accuracy is not credible. Moreover, if the padding length is shortened, the precision will drop sharply.
Let's talk about the solutions and new ideas.
Traceback (most recent call last):
File "train.py", line 87, in
test_input(model,sess,word2id,id2tag,batch_size)
File "/mnt/ChineseNER-master/tensorflow/utils.py", line 169, in test_input
entity = get_entity(text,pre[0],id2tag)
File "/mnt/ChineseNER-master/tensorflow/utils.py", line 40, in get_entity
entity=id2tag[y[i][j]][1:]+':'+x[i][j]
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
../data/Bosondata.pkl 数据集问题,你好,能不能发我该数据集呢?
为何msra的test语料是无标注的呢?
第一次接触神经网络,哪位大佬能指点一下大概的运行步骤,感谢
您好,我在GPU上运行pytorch版本的代码时出现了以下错误,请问如何解决:
Traceback (most recent call last):
File "train.py", line 59, in
loss = model.neg_log_likelihood(sentence, tags)
File "/home/sjwang/data111/ChineseNER-master/pytorch/BiLSTM_CRF.py", line 154, in neg_log_likelihood
feats = self._get_lstm_features(sentence)
File "/home/sjwang/data111/ChineseNER-master/pytorch/BiLSTM_CRF.py", line 94, in _get_lstm_features
lstm_out, self.hidden = self.lstm(embeds, self.hidden)
File "/home/sjwang/py/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/sjwang/py/python3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 179, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu
train len: 36064
test len: 5010
word2id len 4026
Creating the data generator ...
Finished creating the data generator.
begin to train...
Traceback (most recent call last):
File "train.py", line 108, in
model = Model(config,embedding_pre,dropout_keep=0.5)
File "/home/homework/proj/tensorflow/ChineseNER-master/tensorflow/bilstm_crf.py", line 20, in init
self._build_net()
File "/home/homework/proj/tensorflow/ChineseNER-master/tensorflow/bilstm_crf.py", line 57, in _build_net
self.viterbi_sequence, viterbi_score = tf.contrib.crf.crf_decode(bilstm_out, self.transition_params,tf.tile(np.array([self.sen_len]),np.array([self.batch_size])))
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/crf/python/ops/crf.py", line 537, in crf_decode
false_fn=_multi_seq_fn)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/utils.py", line 206, in smart_cond
pred, true_fn=true_fn, false_fn=false_fn, name=name)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/smart_cond.py", line 56, in smart_cond
return false_fn()
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/crf/python/ops/crf.py", line 501, in _multi_seq_fn
sequence_length_less_one = math_ops.maximum(0, sequence_length - 1)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4602, in maximum
"Maximum", x=x, y=y, name=name)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 546, in _apply_op_helper
inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'Maximum' Op has type int64 that does not match type int32 of argument 'x'.
谷歌也没搜出来解决办法,应该在哪加一个数据类型转换?
def get_entity(x,y,id2tag):
entity=""
res=[]
for i in range(len(x)): #for every sen
for j in range(len(x[0])): #for every word
...
...
这两个for循环中第一个是没问题的,因为统一batchsize大小了,但第二个for循环就有问题了,
因为x[0]的大小可能会大于60 这里没有做cut操作,因此j会大于60 导致y[i][j]数组越界,希望楼主可以更改一下。加上一个判断
for i in range(len(x)): #for every sen
if len(x[0]) > 60:
num = 60
else:
num = len(x[0])
for j in range(num): #for every word
这样都保证为60,就不会出现y[i][j]数组越界的问题了额
用TensorFlow训练Bosondata.pkl时出错。
File "E:/ChineseNER-master/tensorflow/train.py", line 121, in
entityall = calculate(x_batch,y_batch,id2word,id2tag,entityall)
File "E:\ChineseNER-master\tensorflow\resultCal.py", line 9, in calculate
if id2tag[y[i][j]][0]=='B':
IndexError: string index out of range
你好,我在测试的时候输入文本中包含英文的时候就会出现下面的错误,请问怎么解决呢,非常感谢
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 4026 is not in [0, 4026)
[[{{node bilstm_crf/embedding_lookup}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bilstm_crf/Assign, _arg_input_data/input_data_0_0, bilstm_crf/embedding_lookup/axis)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
Caused by op 'bilstm_crf/embedding_lookup', defined at:
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/train.py", line 94, in
model = Model(config, embedding_pre, dropout_keep=1)
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py", line 32, in init
self._build_net()
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py", line 40, in _build_net
input_embedded = tf.nn.embedding_lookup(word_embeddings, self.input_data)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 313, in embedding_lookup
transform_fn=None)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 133, in _embedding_lookup_and_transform
result = _clip(array_ops.gather(params[0], ids, name=name),
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2675, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3332, in gather_v2
"GatherV2", params=params, indices=indices, axis=axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): indices[0,0] = 4026 is not in [0, 4026)
[[node bilstm_crf/embedding_lookup (defined at /ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py:40) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bilstm_crf/Assign, _arg_input_data/input_data_0_0, bilstm_crf/embedding_lookup/axis)]]
为什么训练好的模型只能处理三类数据
ValueError: Cannot feed value of shape (32, 60) for Tensor 'input_data:0', which has shape '(32, 50)'
您好,想问一下,如果我们想在训练集中加入股票新闻数据,然后增加 股票名称 这个标签(除了loc,name,org这些标签以外的新标签)的话,我们应该怎么去做呢?万分谢谢!
應該怎樣修改? 可否提示
謝謝
你好,“文章链接打开了”,能否重新发一下链接,或者给你文章的名字
这个x_padding会截取大于60长度的语句,这里是不是不太合理?
我下载了腾讯新推出的数据集,有十几个G。请问如何加入到您的项目中,使训练集更加准确呢?
有没有小伙伴有pytorch版本模型测试部分的代码呀?已经运行了train.py保存了模型,但是本入门小白不知道要怎么测试这个模型……
'ascii' codec can't decode byte 0x80 in position 1016: ordinal not in range(128)
请问这是什么原因?有解决办法吗?
作者有遇到过所有字的预测准确率很高,但是一个命名实体都抽取不出来的预测结果吗?
这个问题如何解决呢?有没有小伙伴知道
ChineseNER/data/renMinRiBao/data_renmin_word.py中的85行到88行,为什么会只保留numNotO!=0的linedata?
我看您在README中提到pytorch版的模型是直接用了pytorch官方提供的BiLSTM_CRF模型,但是官方模型中设置的EMBEDDING_DIM
为tag_to_index的长度,也就是5,HIDDEN_DIM
为4。
我看到一篇博客对这两个参数的讲解是:由于标签一共有B\I\O\START\STOP 5个,所以EMBEDDING_DIM
为5;HIDDEN_DIM
为4,是BiLSTM的隐藏层的特征数量,因为是双向所以是2倍,单向为2。
我看您设置EMBEDDING_DIM
为100,HIDDEN_DIM
为200。
然后我把两组参数在人民日报的数据集上都都跑了一遍,EPOCH
设置的都是30,但是两组的F1
值差的很大:
(按官网例子)EMBEDDING_DIM
为len(tag_to_ix)
,HIDDEN_DIM
为4时,F1
值为55%~60%;
(您的版本)EMBEDDING_DIM
为100,HIDDEN_DIM
为200时,F1
值为80%左右。
想请问一下您,EMBEDDING_DIM
和HIDDEN_DIM
在设置时有没有必须要遵守的规则(例如:EMBEDDING_DIM
必须为所有标签的个数)?
with open(label_path, "w") as fw:
line = []
for sent_result in label_predict:
for char, tag, tag_ in sent_result:
tag = '0' if tag == 'O' else tag
char = char.encode("utf-8")
line.append("{} {} {}\n".format(char, tag, tag_))
line.append("\n")
fw.writelines(line)
请问有大佬实现了在官网的pytorch BiLSTM_CRF模型的基础上,加入batch提速吗?
生成的model.如何用新文本做测试,显示出对应实体?
报错信息为:
Traceback (most recent call last):
File "train2pkl.py", line 94, in
from compiler.ast import flatten
ImportError: No module named 'compiler.ast'
网上查阅资料说是因为在python3以后这段话就被废除了,请问如何解决呢?
ChineseNER/data/renMinRiBao/data_renmin_word.py中,这一句:
sentences = re.split('[,。!?、‘’“”:]/[O]'.decode('utf-8'), texts)
[,。!?、‘’“”:]/[O] 这个正则式子里的/[O]如何理解?
tensorflow版本的tran.py文件在加载预训练向量的时候,原始的是word2id找到的是id,而不是字,word2vec里的词向量都是以字为主的,所以word2id应该改成id2word. 如果不改的话 也是能用的,不过embedding_pre全部是111111111,而不是vec.txt里的字向量。建议楼主改下这个坑,很难发现的。
将这个模型用到了项目中,开发了一个基于flask框架的网页服务。在并发的情况下(例如:同时有100个输入),用jmeter测试,响应时间(不算模型的加载时间,只有实体抽取部分的时间)很慢。吞吐量最高只能达到12.2/sec,响应时间大约是8050ms。
有人做这方面的优化吗?
E.g. if a word (北大) is not recognized as an organisation, could we add this word to let the model know this word?
File "E:\appsnew\codes\gitspace\ChineseNER\tensorflowVersion\utils.py", line 12, in calculate
if id2tag[y[i][j]][0]=='B':
IndexError: string index out of range
这个问题怎么解决
老哥,利用pytorch预测时,将每个单词的tag求出来后,直接找可构成组合吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.