hertz-pj / bert-bilstm-crf-ner-pytorch Goto Github PK

View Code? Open in Web Editor NEW

393.0 393.0 65.0 22 KB

Pytorch BERT-BiLSTM-CRF For NER

Python 100.00%

bert bilstm-crf chinese named-entity-recognition pytorch

bert-bilstm-crf-ner-pytorch's Introduction

Hi there 👋

🎉 Welcome to my GitHub profile!

bert-bilstm-crf-ner-pytorch's People

Contributors

Stargazers

Watchers

bert-bilstm-crf-ner-pytorch's Issues

你好, 可否告訴, 如何做predict?

I have trained the model.... and was planning to use the output model to predict the NER:

here are my codes.
I defined max seq length =128 when training the model.

My initial of loading configs, tokenizer and model are all ok, but

still encounter with the following errors:

need_birnn : True
rnn_dim: 128
max_seq_length: 128

input_text='創科局常任秘書長蔡淑嫻今早終於港台節目訪問，公開晶苑（2232）處理當中較大規模口罩生產，廠房包括荃灣南豐紗廠、土瓜灣聯業製衣，及晶苑於越南的廠房。'

textlist=list(input_text)
tokens = [tokenizer.tokenize(word) for  word in enumerate(textlist)]

if len(tokens) >= max_seq_length - 1:
    tokens = tokens[0:(max_seq_length - 2)]  # -2 的原因是因为序列需要加一个句首和句尾标志

ntokens = ["[CLS]"] + tokens + ["[SEP]"]

sinput_ids = tokenizer.convert_tokens_to_ids(ntokens)   
segment_ids = [0] * len(input_ids)        
input_mask = [1] * len(input_ids)

while len(input_ids) < max_seq_length:
    input_ids.append(0)
    segment_ids.append(0)
    input_mask.append(0)
  
assert len(input_ids) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(input_mask) == max_seq_length

    
input_ids = torch.tensor(input_ids, dtype=torch.long)
segment_ids = torch.tensor(segment_ids, dtype=torch.long)
input_mask = torch.tensor(input_mask, dtype=torch.long)


input_ids = input_ids.to(device)
segment_ids = segment_ids.to(device)
input_mask = input_mask.to(device)

print(input_ids.shape, input_mask.shape)

torch.Size([128]) torch.Size([128])

pred_labels = []
with torch.no_grad():
    logits = model.predict(input_ids, segment_ids, input_mask)

Traceback (most recent call last):
File "/home/ubuntu/PeijiYang/predict.py", line 155, in
logits = model.predict(input_ids, segment_ids, input_mask)
File "/home/ubuntu/PeijiYang/models.py", line 53, in predict
emissions = self.tag_outputs(input_ids, token_type_ids, input_mask)
File "/home/ubuntu/PeijiYang/models.py", line 40, in tag_outputs
outputs = self.bert(input_ids, token_type_ids=token_type_ids, attention_mask=input_mask)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 725, in forward
input_shape, attention_mask.shape
ValueError: Wrong shape for input_ids (shape torch.Size([128])) or attention_mask (shape torch.Size([128]))

if i user electra pretrained model instead of bert-chinese-base , any code i have to edit?

same as below :)

您好，使用我的数据存在一个bug

我遇到一个bug，模型在训练阶段验证准确率是80%，然后我save模型后测试同一批数据，准确率很低很低
，然后我把save的模型 backward 训练一轮再测试数据了下，准确率又上去了很奇怪

FileNotFoundError: [Errno 2] No such file or directory: 'model\\training_args.bin'

请问怎么解决，谢谢

Can't load my pretrained BERT.

Hello.

I want to run your code using another pretrained BERT from https://huggingface.co/indobenchmark/indobert-large-p2. In ner.py, I have write the specific pretrained model path like this

model = BERT_BiLSTM_CRF.from_pretrained("/indobenchmark/indobert-large-p2", config=config, 
             need_birnn=args.need_birnn, rnn_dim=args.rnn_dim)

But I got error shown below:

Can't you suggest how to solve this issue?

Thanks before.

使用英文数据集(Conll2003)时，tokenizer的问题

您好，感谢对我再“出现wordPiece应该怎么办？”这一问题下的提问了。
是我描述不到位，我遇到的问题是：
数据集使用的Conll2003，Bert模型使用的是bert-base-cased。运行时出现如下错误：

File "D:\python-workspace\BERT-BiLSTM-CRF-NER-pytorch-master\utils.py", line 162, in convert_examples_to_features
assert len(ori_tokens) == len(ntokens), f"{len(ori_tokens)}, {len(ntokens)}, {ori_tokens}, {ntokens}"
AssertionError: 3, 8, ['[CLS]', '-DOCSTART-', '[SEP]'], ['[CLS]', '-', 'do', '##cs', '##tar', '##t', '-', '[SEP]']

可见，是tokenizer将单词切分了，导致assert len(ori_tokens) == len(ntokens)不能通过，请问如何解决？感谢您。

出现wordPiece应该怎么办？

这样导致原本的word和token之间对应不上。

你好，使用我的数据集测试存在一个问题

训练好模型后，再次调用仅把do_test 设置为true，train和eval设置为false。
test的数据集里有数据，却读取不到，显示：
02/28/2022 12:44:43 - INFO - main - Num examples = 0
求问这个问题该如何解决呢？
谢谢！！

多GPU运行出错，

extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
卡在这语句上，感觉应该是维度不匹配的问题，应该怎么修改啊？

请问能给发一下处理好的BIO数据集吗？谢谢了

请问这个运行测试集时输出的结果为什么没有呢

'NoneType' object has no attribute 'byte' in models.py predict method

At the predict method, if I do not specify input_mask the crf will not be able to execute decode due to None.byte()

你好，问下使用下载好的预训练模型

我下载了一个bert-base-chinese的模型，包含config.json，pytorch_model.bin，和vocab.txt。这三个文件放到bert-base-chinese文件夹里，但是执行代码显示未能初始化模型权重，仍然要重新下载，速度很慢。请问下载好的预训练模型要如何使用呢？

我跑的时候提示loss不是scalar，打印了一下size,是bacth大小，是不是得loss.sum()之后再backward？

为啥我复现不到80.45 是我的tricks有问题吗

predict结果中ori_labels缺少一行

首先感谢您的代码！

在按照您提供的数据集格式替换了自己的输入及训练数据后，最终得到的预测文件中存在测试标签偏移的情况（如图，“深”应为B-ORG，此处为test集原标签），查找源代码后发现，output文件中原预测标签（all_ori_labels）来自utils.get_Dataset的第一个返回examples（来自get_examples），是对BIO格式文件做的一个简单读取（取左列为token，取右列为label）。

我猜测可能是因为该读取未为all_ori_labels中的每个句子添加标记，导致ori_labels的开头与结尾较ori_tokens及prel缺少两个标签，因此在最后输出文件跳过标签时，吞掉了all_ori_labels第一个字的标签。不过，在训练过程中，并未引用到get_examples,而是使用了添加过这些标签的TensorDataset（），所以应该不会结果产生其他影响=）

也许您能告诉我是否是我错误注释掉了某部分代码，或有其他忽略的细节导致的错误