Giter VIP home page Giter VIP logo

chinesener's People

Contributors

buppt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinesener's Issues

您好

已经下载了您的项目,但是初学者面对这些内容不知道该从何下手。博主有什么建议掌握项目运行的步骤呢?

训练模型报错

错误代码:
train len: 10721
test len: 3351
valid len 2681
Traceback (most recent call last):
File "train.py", line 141, in
calculate(x_test,y_test,epoch)
File "train.py", line 63, in calculate
if j<len(y) and id2tag[y[j]][0]=='B':
IndexError: string index out of range

预测的准确性问题

中文版:
python重现了该代码:
https://gitee.com/chashaozgr/noteLibrary/tree/master/nlp_trial/ner/src/bilstm_crf

用的人民日报的数据,python3,tensorflow==1.12

准确率确如readme所示,但是从混淆矩阵看来,由于用了padding的方法进行了预测,所以实际为0类的量(即补充部分)远比其他类多,导致样本标签不均衡,所以准确性不可信,85%+的准确性大部分来源于0类分给0类,如果缩短padding长度,precision会迅速下降。

看看大家有没有什么对策。

English version:
I reproduced the code here:
https://gitee.com/chashaozgr/noteLibrary/tree/master/nlp_trial/ner/src/bilstm_crf

Data source People’s Daily
Environment: python3.6, tensorflow==1.12

The accuracy tested is the same as shown in the readme, but according to the confusion matrix, it does not well as expected. Since the padding method is used for prediction, the number of samples with class 0, the supplementary part is much more than the other classes, resulting in unbalanced sample tags. Therefore, the accuracy is not credible. Moreover, if the padding length is shortened, the precision will drop sharply.

Let's talk about the solutions and new ideas.

代码不全

image
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
train(model, sess, saver, epochs, batch_size, data_train, data_test, id2word, id2tag)

博主您好,想问您关于编码格式运行报错的问题,按照网上的方法改了很多次,还是出错。

Traceback (most recent call last):
File "train.py", line 87, in
test_input(model,sess,word2id,id2tag,batch_size)
File "/mnt/ChineseNER-master/tensorflow/utils.py", line 169, in test_input
entity = get_entity(text,pre[0],id2tag)
File "/mnt/ChineseNER-master/tensorflow/utils.py", line 40, in get_entity
entity=id2tag[y[i][j]][1:]+':'+x[i][j]
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data

msra语料

为何msra的test语料是无标注的呢?

在GPU上运行pytorch版本的代码

您好,我在GPU上运行pytorch版本的代码时出现了以下错误,请问如何解决:
Traceback (most recent call last):
File "train.py", line 59, in
loss = model.neg_log_likelihood(sentence, tags)
File "/home/sjwang/data111/ChineseNER-master/pytorch/BiLSTM_CRF.py", line 154, in neg_log_likelihood
feats = self._get_lstm_features(sentence)
File "/home/sjwang/data111/ChineseNER-master/pytorch/BiLSTM_CRF.py", line 94, in _get_lstm_features
lstm_out, self.hidden = self.lstm(embeds, self.hidden)
File "/home/sjwang/py/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/sjwang/py/python3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 179, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu

train error

train len: 36064
test len: 5010
word2id len 4026
Creating the data generator ...
Finished creating the data generator.
begin to train...
Traceback (most recent call last):
File "train.py", line 108, in
model = Model(config,embedding_pre,dropout_keep=0.5)
File "/home/homework/proj/tensorflow/ChineseNER-master/tensorflow/bilstm_crf.py", line 20, in init
self._build_net()
File "/home/homework/proj/tensorflow/ChineseNER-master/tensorflow/bilstm_crf.py", line 57, in _build_net
self.viterbi_sequence, viterbi_score = tf.contrib.crf.crf_decode(bilstm_out, self.transition_params,tf.tile(np.array([self.sen_len]),np.array([self.batch_size])))
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/crf/python/ops/crf.py", line 537, in crf_decode
false_fn=_multi_seq_fn)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/utils.py", line 206, in smart_cond
pred, true_fn=true_fn, false_fn=false_fn, name=name)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/smart_cond.py", line 56, in smart_cond
return false_fn()
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/crf/python/ops/crf.py", line 501, in _multi_seq_fn
sequence_length_less_one = math_ops.maximum(0, sequence_length - 1)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4602, in maximum
"Maximum", x=x, y=y, name=name)
File "/home/homework/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 546, in _apply_op_helper
inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'Maximum' Op has type int64 that does not match type int32 of argument 'x'.

谷歌也没搜出来解决办法,应该在哪加一个数据类型转换?

tesorflow版本的utilit下的 get_entities()函数的bug

def get_entity(x,y,id2tag):
entity=""
res=[]
for i in range(len(x)): #for every sen
for j in range(len(x[0])): #for every word
...
...
这两个for循环中第一个是没问题的,因为统一batchsize大小了,但第二个for循环就有问题了,
因为x[0]的大小可能会大于60 这里没有做cut操作,因此j会大于60 导致y[i][j]数组越界,希望楼主可以更改一下。加上一个判断
for i in range(len(x)): #for every sen
if len(x[0]) > 60:
num = 60
else:
num = len(x[0])
for j in range(num): #for every word

这样都保证为60,就不会出现y[i][j]数组越界的问题了额

出错

用TensorFlow训练Bosondata.pkl时出错。
File "E:/ChineseNER-master/tensorflow/train.py", line 121, in
entityall = calculate(x_batch,y_batch,id2word,id2tag,entityall)
File "E:\ChineseNER-master\tensorflow\resultCal.py", line 9, in calculate
if id2tag[y[i][j]][0]=='B':
IndexError: string index out of range

tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 4026 is not in [0, 4026)

你好,我在测试的时候输入文本中包含英文的时候就会出现下面的错误,请问怎么解决呢,非常感谢
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 4026 is not in [0, 4026)
[[{{node bilstm_crf/embedding_lookup}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bilstm_crf/Assign, _arg_input_data/input_data_0_0, bilstm_crf/embedding_lookup/axis)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
Caused by op 'bilstm_crf/embedding_lookup', defined at:
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/train.py", line 94, in
model = Model(config, embedding_pre, dropout_keep=1)
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py", line 32, in init
self._build_net()
File "/ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py", line 40, in _build_net
input_embedded = tf.nn.embedding_lookup(word_embeddings, self.input_data)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 313, in embedding_lookup
transform_fn=None)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 133, in _embedding_lookup_and_transform
result = _clip(array_ops.gather(params[0], ids, name=name),
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2675, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3332, in gather_v2
"GatherV2", params=params, indices=indices, axis=axis, name=name)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[0,0] = 4026 is not in [0, 4026)
[[node bilstm_crf/embedding_lookup (defined at /ddhome/usr/PythonProjects/NLP/ChineseNER/tf_version/bilstm_crf.py:40) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bilstm_crf/Assign, _arg_input_data/input_data_0_0, bilstm_crf/embedding_lookup/axis)]]

关于实体识别增加标签的问题

您好,想问一下,如果我们想在训练集中加入股票新闻数据,然后增加 股票名称 这个标签(除了loc,name,org这些标签以外的新标签)的话,我们应该怎么去做呢?万分谢谢!

“这篇文章”

你好,“文章链接打开了”,能否重新发一下链接,或者给你文章的名字

博主您好

我下载了腾讯新推出的数据集,有十几个G。请问如何加入到您的项目中,使训练集更加准确呢?

pytorch版本模型测试代码

有没有小伙伴有pytorch版本模型测试部分的代码呀?已经运行了train.py保存了模型,但是本入门小白不知道要怎么测试这个模型……

pytorch pkl编码报错

'ascii' codec can't decode byte 0x80 in position 1016: ordinal not in range(128)

请问这是什么原因?有解决办法吗?

tensorflow版本的NER

作者有遇到过所有字的预测准确率很高,但是一个命名实体都抽取不出来的预测结果吗?

关于pytorch BiLSTM_CRF模型中EMBEDDING_DIM和HIDDEN_DIM的设置

我看您在README中提到pytorch版的模型是直接用了pytorch官方提供的BiLSTM_CRF模型,但是官方模型中设置的EMBEDDING_DIM为tag_to_index的长度,也就是5,HIDDEN_DIM为4。

我看到一篇博客对这两个参数的讲解是:由于标签一共有B\I\O\START\STOP 5个,所以EMBEDDING_DIM为5;HIDDEN_DIM为4,是BiLSTM的隐藏层的特征数量,因为是双向所以是2倍,单向为2。

我看您设置EMBEDDING_DIM为100,HIDDEN_DIM为200。

然后我把两组参数在人民日报的数据集上都都跑了一遍,EPOCH设置的都是30,但是两组的F1值差的很大:
(按官网例子)EMBEDDING_DIMlen(tag_to_ix)HIDDEN_DIM为4时,F1值为55%~60%;
(您的版本)EMBEDDING_DIM为100,HIDDEN_DIM为200时,F1值为80%左右。

想请问一下您,EMBEDDING_DIMHIDDEN_DIM在设置时有没有必须要遵守的规则(例如:EMBEDDING_DIM必须为所有标签的个数)?

测试

生成的model.如何用新文本做测试,显示出对应实体?

关于 train2.pkl 中from compiler.ast import flatten的问题

报错信息为:
Traceback (most recent call last):
File "train2pkl.py", line 94, in
from compiler.ast import flatten
ImportError: No module named 'compiler.ast'
网上查阅资料说是因为在python3以后这段话就被废除了,请问如何解决呢?

这里什么意思?

ChineseNER/data/renMinRiBao/data_renmin_word.py中,这一句:

sentences = re.split('[,。!?、‘’“”:]/[O]'.decode('utf-8'), texts)

[,。!?、‘’“”:]/[O] 这个正则式子里的/[O]如何理解?

train.py中的预训练向量没加载成功,而全部加载的是111111111111

tensorflow版本的tran.py文件在加载预训练向量的时候,原始的是word2id找到的是id,而不是字,word2vec里的词向量都是以字为主的,所以word2id应该改成id2word. 如果不改的话 也是能用的,不过embedding_pre全部是111111111,而不是vec.txt里的字向量。建议楼主改下这个坑,很难发现的。

并发情况下,响应时间很慢

将这个模型用到了项目中,开发了一个基于flask框架的网页服务。在并发的情况下(例如:同时有100个输入),用jmeter测试,响应时间(不算模型的加载时间,只有实体抽取部分的时间)很慢。吞吐量最高只能达到12.2/sec,响应时间大约是8050ms。
有人做这方面的优化吗?

Could we add new words?

E.g. if a word (北大) is not recognized as an organisation, could we add this word to let the model know this word?

更换Boson数据集pkl

File "E:\appsnew\codes\gitspace\ChineseNER\tensorflowVersion\utils.py", line 12, in calculate
if id2tag[y[i][j]][0]=='B':
IndexError: string index out of range
这个问题怎么解决

预测问题

老哥,利用pytorch预测时,将每个单词的tag求出来后,直接找可构成组合吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.