buppt / chinesener Goto Github PK

View Code? Open in Web Editor NEW

1.4K 18.0 396.0 15.49 MB

中文命名实体识别，实体抽取，tensorflow，pytorch，BiLSTM+CRF

Python 100.00%

ner named-entity-recognition chinese bilstm-crf tensorflow pytorch

chinesener's People

Contributors

Stargazers

Watchers

Forkers

qhduan infinityfuture xiaxyun airob ifeynman chatbot-tube baobaobaobaobao chenjun0210 notbesidemoon jameshsu007 whaozl luolanfeixue wangyiyan3318 shihuaxing oliverkehl krokyin guixianjin betty-zjl sankisun zhouleidcc cjm1044642385 huguanglong huyanluanyu1949 nn-tony hyh012356789 moolighty ellieee777 xingxinyu96 xulisun wibruce topdreamer lufenggui jasonhoou wangzhuoxian mrxiexianzhao godamn meibaihui jidlin joeshpcheung gdh756462786 calmzeala zhongyunuestc liqi0706401043 wangshirui33 skhuang1993 pokbe weifenghu hatleon xsoer lpnemo wxrui hblu banifeng giraffa518 hu1111 2585575866 jimmyxiaodong alexxrliu fangxiaoquan jxfruit zjcanjux saynhuang chenny0808 xuefengsi spring-quan hanhongchang brandon601443243 tinglishen buptorange hackty cxncu001 legendtianjin alchemist1024 codeants2012 lukealee a-little-story blackhandlyh jieli4970 nxf75 ekko98 muyangren123456 liuxiapu tslnihaogit fyh97 chenliy jess639 junlongzhao afanandleo liuhecsdn yiershanxll janciswang zhenjason wujx0213 cdhero caodingperson brokenwind work-er madehong 372046933 bryan2chow

chinesener's Issues

您好

已经下载了您的项目，但是初学者面对这些内容不知道该从何下手。博主有什么建议掌握项目运行的步骤呢？

错误代码：
train len: 10721
test len: 3351
valid len 2681
Traceback (most recent call last):
File "train.py", line 141, in
calculate(x_test,y_test,epoch)
File "train.py", line 63, in calculate
if j<len(y) and id2tag[y[j]][0]=='B':
IndexError: string index out of range

预测的准确性问题

中文版：
python重现了该代码：
https://gitee.com/chashaozgr/noteLibrary/tree/master/nlp_trial/ner/src/bilstm_crf

用的人民日报的数据，python3，tensorflow==1.12

准确率确如readme所示，但是从混淆矩阵看来，由于用了padding的方法进行了预测，所以实际为0类的量（即补充部分）远比其他类多，导致样本标签不均衡，所以准确性不可信，85%+的准确性大部分来源于0类分给0类，如果缩短padding长度，precision会迅速下降。

看看大家有没有什么对策。

English version:
I reproduced the code here:
https://gitee.com/chashaozgr/noteLibrary/tree/master/nlp_trial/ner/src/bilstm_crf

Data source People’s Daily
Environment: python3.6, tensorflow==1.12

The accuracy tested is the same as shown in the readme, but according to the confusion matrix, it does not well as expected. Since the padding method is used for prediction, the number of samples with class 0, the supplementary part is much more than the other classes, resulting in unbalanced sample tags. Therefore, the accuracy is not credible. Moreover, if the padding length is shortened, the precision will drop sharply.

Let's talk about the solutions and new ideas.

代码不全

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
train(model, sess, saver, epochs, batch_size, data_train, data_test, id2word, id2tag)

博主您好，想问您关于编码格式运行报错的问题，按照网上的方法改了很多次，还是出错。

Traceback (most recent call last):
File "train.py", line 87, in
test_input(model,sess,word2id,id2tag,batch_size)
File "/mnt/ChineseNER-master/tensorflow/utils.py", line 169, in test_input
entity = get_entity(text,pre[0],id2tag)
File "/mnt/ChineseNER-master/tensorflow/utils.py", line 40, in get_entity
entity=id2tag[y[i][j]][1:]+':'+x[i][j]
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data

../data/Bosondata.pkl 问题

../data/Bosondata.pkl 数据集问题，你好，能不能发我该数据集呢？

msra语料

为何msra的test语料是无标注的呢？

大佬谁能给一下tensorflow版本的运行步骤

第一次接触神经网络，哪位大佬能指点一下大概的运行步骤，感谢

在GPU上运行pytorch版本的代码

您好，我在GPU上运行pytorch版本的代码时出现了以下错误，请问如何解决：
Traceback (most recent call last):
File "train.py", line 59, in
loss = model.neg_log_likelihood(sentence, tags)
File "/home/sjwang/data111/ChineseNER-master/pytorch/BiLSTM_CRF.py", line 154, in neg_log_likelihood
feats = self._get_lstm_features(sentence)
File "/home/sjwang/data111/ChineseNER-master/pytorch/BiLSTM_CRF.py", line 94, in _get_lstm_features
lstm_out, self.hidden = self.lstm(embeds, self.hidden)
File "/home/sjwang/py/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/sjwang/py/python3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 179, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu

Batch里的next_batch

train error

train len: 36064
test len: 5010
word2id len 4026
Creating the data generator ...
Finished creating the data generator.
begin to train...
Traceback (most recent call last):
File "train.py", line 108, in
model = Model(config,embedding_pre,dropout_keep=0.5)
File "/home/homework/proj/tensorflow/ChineseNER-master/tensorflow/bilstm_crf.py", line 20, in init
self._build_net()
File "/home/homework/proj/tensorflow/ChineseNER-master/tensorflow/bilstm_crf.py", line 57, in _build_net
self.viterbi_sequence, viterbi_score = tf.contrib.crf.crf_decode(bilstm_out, self.transition_params,tf.tile(np.array([self.sen_len]),np.array([self.batch_size])))
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/crf/python/ops/crf.py", line 537, in crf_decode
false_fn=_multi_seq_fn)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/utils.py", line 206, in smart_cond
pred, true_fn=true_fn, false_fn=false_fn, name=name)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/smart_cond.py", line 56, in smart_cond
return false_fn()
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/crf/python/ops/crf.py", line 501, in _multi_seq_fn
sequence_length_less_one = math_ops.maximum(0, sequence_length - 1)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4602, in maximum
"Maximum", x=x, y=y, name=name)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 546, in _apply_op_helper
inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'Maximum' Op has type int64 that does not match type int32 of argument 'x'.

谷歌也没搜出来解决办法，应该在哪加一个数据类型转换？

tesorflow版本的utilit下的 get_entities()函数的bug

def get_entity(x,y,id2tag):
entity=""
res=[]
for i in range(len(x)): #for every sen
for j in range(len(x[0])): #for every word
...
...
这两个for循环中第一个是没问题的，因为统一batchsize大小了，但第二个for循环就有问题了，
因为x[0]的大小可能会大于60 这里没有做cut操作，因此j会大于60 导致y[i][j]数组越界，希望楼主可以更改一下。加上一个判断
for i in range(len(x)): #for every sen
if len(x[0]) > 60:
num = 60
else:
num = len(x[0])
for j in range(num): #for every word

这样都保证为60，就不会出现y[i][j]数组越界的问题了额

出错

用TensorFlow训练Bosondata.pkl时出错。
File "E:/ChineseNER-master/tensorflow/train.py", line 121, in
entityall = calculate(x_batch,y_batch,id2word,id2tag,entityall)
File "E:\ChineseNER-master\tensorflow\resultCal.py", line 9, in calculate
if id2tag[y[i][j]][0]=='B':
IndexError: string index out of range

tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 4026 is not in [0, 4026)

你好，我在测试的时候输入文本中包含英文的时候就会出现下面的错误，请问怎么解决呢，非常感谢
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 4026 is not in [0, 4026)
[[{{node bilstm_crf/embedding_lookup}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bilstm_crf/Assign, _arg_input_data/input_data_0_0, bilstm_crf/embedding_lookup/axis)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
Caused by op 'bilstm_crf/embedding_lookup', defined at:
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/train.py", line 94, in
model = Model(config, embedding_pre, dropout_keep=1)
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py", line 32, in init
self._build_net()
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py", line 40, in _build_net
input_embedded = tf.nn.embedding_lookup(word_embeddings, self.input_data)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 313, in embedding_lookup
transform_fn=None)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 133, in _embedding_lookup_and_transform
result = _clip(array_ops.gather(params[0], ids, name=name),
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2675, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3332, in gather_v2
"GatherV2", params=params, indices=indices, axis=axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[0,0] = 4026 is not in [0, 4026)
[[node bilstm_crf/embedding_lookup (defined at /ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py:40) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bilstm_crf/Assign, _arg_input_data/input_data_0_0, bilstm_crf/embedding_lookup/axis)]]

tensorflow版本处理数据问题

为什么训练好的模型只能处理三类数据

求大佬解释下我这个错误，怎么张量的大小不对啊哭了

ValueError: Cannot feed value of shape (32, 60) for Tensor 'input_data:0', which has shape '(32, 50)'

关于实体识别增加标签的问题

您好，想问一下，如果我们想在训练集中加入股票新闻数据，然后增加股票名称这个标签（除了loc,name,org这些标签以外的新标签）的话，我们应该怎么去做呢？万分谢谢！

如果用來做part of the speech tagger 可以嗎?

應該怎樣修改? 可否提示

謝謝

pytorch中运行train.py的问题

这是我运行过程出现的问题，不懂是出错的原因

“这篇文章”

你好，“文章链接打开了”，能否重新发一下链接，或者给你文章的名字

关于data_renmin_word.py里的x_padding

这个x_padding会截取大于60长度的语句，这里是不是不太合理？

博主您好

我下载了腾讯新推出的数据集，有十几个G。请问如何加入到您的项目中，使训练集更加准确呢？

pytorch版本模型测试代码

有没有小伙伴有pytorch版本模型测试部分的代码呀？已经运行了train.py保存了模型，但是本入门小白不知道要怎么测试这个模型……

pytorch pkl编码报错

'ascii' codec can't decode byte 0x80 in position 1016: ordinal not in range(128)

请问这是什么原因？有解决办法吗？

tensorflow版本的NER

作者有遇到过所有字的预测准确率很高，但是一个命名实体都抽取不出来的预测结果吗？

tensorflow版本问题，已解决

这个问题如何解决呢？有没有小伙伴知道

这种数据处理方法算是什么名字呢？

为什么只保留了含有命名实体的行数据？

ChineseNER/data/renMinRiBao/data_renmin_word.py中的85行到88行，为什么会只保留numNotO!=0的linedata？

关于pytorch BiLSTM_CRF模型中EMBEDDING_DIM和HIDDEN_DIM的设置

我看您在README中提到pytorch版的模型是直接用了pytorch官方提供的BiLSTM_CRF模型，但是官方模型中设置的EMBEDDING_DIM为tag_to_index的长度，也就是5，HIDDEN_DIM为4。

我看到一篇博客对这两个参数的讲解是：由于标签一共有B\I\O\START\STOP 5个，所以EMBEDDING_DIM为5；HIDDEN_DIM为4，是BiLSTM的隐藏层的特征数量，因为是双向所以是2倍，单向为2。

我看您设置EMBEDDING_DIM为100，HIDDEN_DIM为200。

然后我把两组参数在人民日报的数据集上都都跑了一遍，EPOCH设置的都是30，但是两组的F1值差的很大：
（按官网例子）EMBEDDING_DIM为len(tag_to_ix)，HIDDEN_DIM为4时，F1值为55%~60%；
（您的版本）EMBEDDING_DIM为100，HIDDEN_DIM为200时，F1值为80%左右。

想请问一下您，EMBEDDING_DIM和HIDDEN_DIM在设置时有没有必须要遵守的规则（例如：EMBEDDING_DIM必须为所有标签的个数）？

博主您好，在eval.py中有一小句代码（tag='0' if tag=='O' else tag）看不太懂，博主您是否能解释一下呢。这里都把tag的值赋为0了为什么还要检测是不是为'O'呢？

with open(label_path, "w") as fw:
line = []
for sent_result in label_predict:
for char, tag, tag_ in sent_result:
tag = '0' if tag == 'O' else tag
char = char.encode("utf-8")
line.append("{} {} {}\n".format(char, tag, tag_))
line.append("\n")
fw.writelines(line)

关于Pytorch版本中加入Batch

请问有大佬实现了在官网的pytorch BiLSTM_CRF模型的基础上，加入batch提速吗？

tensorflow/utils.py里的第109~127行是不是跟前面几行重复了？

测试

生成的model.如何用新文本做测试，显示出对应实体？

关于 train2.pkl 中from compiler.ast import flatten的问题

报错信息为：
Traceback (most recent call last):
File "train2pkl.py", line 94, in
from compiler.ast import flatten
ImportError: No module named 'compiler.ast'
网上查阅资料说是因为在python3以后这段话就被废除了，请问如何解决呢？

这里什么意思？

ChineseNER/data/renMinRiBao/data_renmin_word.py中，这一句：

sentences = re.split('[，。！？、‘’“”:]/[O]'.decode('utf-8'), texts)

[，。！？、‘’“”:]/[O] 这个正则式子里的/[O]如何理解？