Giter VIP home page Giter VIP logo

bert-bilstm-crf-ner-pytorch's Introduction

Hi there 👋

🎉 Welcome to my GitHub profile!

bert-bilstm-crf-ner-pytorch's People

Contributors

hertz-pj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bert-bilstm-crf-ner-pytorch's Issues

你好, 可否告訴, 如何做predict?

I have trained the model.... and was planning to use the output model to predict the NER:

here are my codes.
I defined max seq length =128 when training the model.

My initial of loading configs, tokenizer and model are all ok, but

still encounter with the following errors:

need_birnn : True
rnn_dim: 128
max_seq_length: 128

input_text='創科局常任秘書長蔡淑嫻今早終於港台節目訪問,公開晶苑(2232)處理當中較大規模口罩生產,廠房包括荃灣南豐紗廠、土瓜灣聯業製衣,及晶苑於越南的廠房。'

textlist=list(input_text)
tokens = [tokenizer.tokenize(word) for  word in enumerate(textlist)]

if len(tokens) >= max_seq_length - 1:
    tokens = tokens[0:(max_seq_length - 2)]  # -2 的原因是因为序列需要加一个句首和句尾标志

ntokens = ["[CLS]"] + tokens + ["[SEP]"]

sinput_ids = tokenizer.convert_tokens_to_ids(ntokens)   
segment_ids = [0] * len(input_ids)        
input_mask = [1] * len(input_ids)

while len(input_ids) < max_seq_length:
    input_ids.append(0)
    segment_ids.append(0)
    input_mask.append(0)
  
assert len(input_ids) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(input_mask) == max_seq_length

    
input_ids = torch.tensor(input_ids, dtype=torch.long)
segment_ids = torch.tensor(segment_ids, dtype=torch.long)
input_mask = torch.tensor(input_mask, dtype=torch.long)


input_ids = input_ids.to(device)
segment_ids = segment_ids.to(device)
input_mask = input_mask.to(device)

print(input_ids.shape, input_mask.shape)

torch.Size([128]) torch.Size([128])

pred_labels = []
with torch.no_grad():
    logits = model.predict(input_ids, segment_ids, input_mask)

Traceback (most recent call last):
File "/home/ubuntu/PeijiYang/predict.py", line 155, in
logits = model.predict(input_ids, segment_ids, input_mask)
File "/home/ubuntu/PeijiYang/models.py", line 53, in predict
emissions = self.tag_outputs(input_ids, token_type_ids, input_mask)
File "/home/ubuntu/PeijiYang/models.py", line 40, in tag_outputs
outputs = self.bert(input_ids, token_type_ids=token_type_ids, attention_mask=input_mask)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 725, in forward
input_shape, attention_mask.shape
ValueError: Wrong shape for input_ids (shape torch.Size([128])) or attention_mask (shape torch.Size([128]))

您好,使用我的数据存在一个bug

我遇到一个bug, 模型在训练阶段验证准确率是80%,然后我save模型后测试同一批数据,准确率很低很低
,然后我把save的模型 backward 训练一轮 再测试数据了下,准确率又上去了 很奇怪

Can't load my pretrained BERT.

Hello.

I want to run your code using another pretrained BERT from https://huggingface.co/indobenchmark/indobert-large-p2. In ner.py, I have write the specific pretrained model path like this

model = BERT_BiLSTM_CRF.from_pretrained("/indobenchmark/indobert-large-p2", config=config, 
             need_birnn=args.need_birnn, rnn_dim=args.rnn_dim)

But I got error shown below:
error

Can't you suggest how to solve this issue?

Thanks before.

使用英文数据集(Conll2003)时,tokenizer的问题

您好,感谢对我再“出现wordPiece应该怎么办?”这一问题下的提问了。
是我描述不到位,我遇到的问题是:
数据集使用的Conll2003,Bert模型使用的是bert-base-cased。运行时出现如下错误:

File "D:\python-workspace\BERT-BiLSTM-CRF-NER-pytorch-master\utils.py", line 162, in convert_examples_to_features
assert len(ori_tokens) == len(ntokens), f"{len(ori_tokens)}, {len(ntokens)}, {ori_tokens}, {ntokens}"
AssertionError: 3, 8, ['[CLS]', '-DOCSTART-', '[SEP]'], ['[CLS]', '-', 'do', '##cs', '##tar', '##t', '-', '[SEP]']

可见,是tokenizer将单词切分了,导致assert len(ori_tokens) == len(ntokens)不能通过,请问如何解决?感谢您。

你好,使用我的数据集测试存在一个问题

训练好模型后,再次调用仅把do_test 设置为true,train和eval设置为false。
test的数据集里有数据,却读取不到,显示:
02/28/2022 12:44:43 - INFO - main - Num examples = 0
求问这个问题该如何解决呢?
谢谢!!

多GPU运行出错,

extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
卡在这语句上,感觉应该是维度不匹配的问题,应该怎么修改啊?

你好,问下使用下载好的预训练模型

我下载了一个bert-base-chinese的模型,包含config.json,pytorch_model.bin,和vocab.txt。这三个文件放到bert-base-chinese文件夹里,但是执行代码显示未能初始化模型权重,仍然要重新下载,速度很慢。请问下载好的预训练模型要如何使用呢?

predict结果中ori_labels缺少一行

首先感谢您的代码!

在按照您提供的数据集格式替换了自己的输入及训练数据后,最终得到的预测文件中存在测试标签偏移的情况(如图,“深”应为B-ORG,此处为test集原标签),查找源代码后发现,output文件中原预测标签(all_ori_labels)来自utils.get_Dataset的第一个返回examples(来自get_examples),是对BIO格式文件做的一个简单读取(取左列为token,取右列为label)。

我猜测可能是因为该读取未为all_ori_labels中的每个句子添加标记,导致ori_labels的开头与结尾较ori_tokens及prel缺少两个标签,因此在最后输出文件跳过标签时,吞掉了all_ori_labels第一个字的标签。不过,在训练过程中,并未引用到get_examples,而是使用了添加过这些标签的TensorDataset(),所以应该不会结果产生其他影响=)

也许您能告诉我是否是我错误注释掉了某部分代码,或有其他忽略的细节导致的错误
image

跪求数据集

非常想知道您是用CLUE上的哪个数据集呀。或者是可否发到邮箱一份呢[email protected]
本人小小白,在CLUE上扒拉了半天也没找到合适的数据集,就一直跑不通
非常感谢~

AttributeError: 'DataParallel' object has no attribute 'predict'

单GPU训练会报错
RuntimeError: CUDA out of memory.

然后使用多GPU训练,然而在执行ner.py的第67行的时候会报
AttributeError: 'DataParallel' object has no attribute 'predict'

本人小白,能否帮忙指点迷津,非常感谢!!!

如何用自己的数据集训练?

作者您好,我想用自己的数据集训练,但我的数据集中的标签与CLUENER中的不一致,请问该如何修改代码?谢谢

指标计算出错

用给的官方数据集和自己的数据集测试均得到precision、recall和FB1均为0.不知道什么原因

如何对测试集做预测

通过您的数据集处理代码处理后的测试集标签全为‘O’,加载模型预测结束后只返回预测数量,不显示准确率等指标

任务标签较多时的建议

如果本人的命名体识别任务有数千个标签,且每个标签出现的次数并不是很多(有些可能只出现几次),那么和本作品相比,应该有哪些需要更改的地方呢?比如学习率是否应该减小?减小多大数量级合适?是否应该使用birnn?谢谢!

测试集预测结果出错?

貌似测试数据也是按句长切分?长句之间添加了空行来区分输入长度吗?为什么预测出来的结果空行前一个字符都掉了?导致test文件共52316行,而测试结果为50971行。。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.