Giter VIP home page Giter VIP logo

bert-ner-pytorch's Introduction

Chinese NER using Bert

BERT for Chinese NER.

update:其他一些可以参考,包括Biaffine、GlobalPointer等:examples

dataset list

  1. cner: datasets/cner
  2. CLUENER: https://github.com/CLUEbenchmark/CLUENER

model list

  1. BERT+Softmax
  2. BERT+CRF
  3. BERT+Span

requirement

  1. 1.1.0 =< PyTorch < 1.5.0
  2. cuda=9.0
  3. python3.6+

input format

Input format (prefer BIOS tag scheme), with each character its label for one line. Sentences are splited with a null line.

美	B-LOC
国	I-LOC
的	O
华	B-PER
莱	I-PER
士	I-PER

我	O
跟	O
他	O

run the code

  1. Modify the configuration information in run_ner_xxx.py or run_ner_xxx.sh .
  2. sh scripts/run_ner_xxx.sh

note: file structure of the model

├── prev_trained_model
|  └── bert_base
|  |  └── pytorch_model.bin
|  |  └── config.json
|  |  └── vocab.txt
|  |  └── ......

CLUENER result

The overall performance of BERT on dev:

Accuracy (entity) Recall (entity) F1 score (entity)
BERT+Softmax 0.7897 0.8031 0.7963
BERT+CRF 0.7977 0.8177 0.8076
BERT+Span 0.8132 0.8092 0.8112
BERT+Span+adv 0.8267 0.8073 0.8169
BERT-small(6 layers)+Span+kd 0.8241 0.7839 0.8051
BERT+Span+focal_loss 0.8121 0.8008 0.8064
BERT+Span+label_smoothing 0.8235 0.7946 0.8088

ALBERT for CLUENER

The overall performance of ALBERT on dev:

model version Accuracy(entity) Recall(entity) F1(entity) Train time/epoch
albert base_google 0.8014 0.6908 0.7420 0.75x
albert large_google 0.8024 0.7520 0.7763 2.1x
albert xlarge_google 0.8286 0.7773 0.8021 6.7x
bert google 0.8118 0.8031 0.8074 -----
albert base_bright 0.8068 0.7529 0.7789 0.75x
albert large_bright 0.8152 0.7480 0.7802 2.2x
albert xlarge_bright 0.8222 0.7692 0.7948 7.3x

Cner result

The overall performance of BERT on dev(test):

Accuracy (entity) Recall (entity) F1 score (entity)
BERT+Softmax 0.9586(0.9566) 0.9644(0.9613) 0.9615(0.9590)
BERT+CRF 0.9562(0.9539) 0.9671(0.9644) 0.9616(0.9591)
BERT+Span 0.9604(0.9620) 0.9617(0.9632) 0.9611(0.9626)
BERT+Span+focal_loss 0.9516(0.9569) 0.9644(0.9681) 0.9580(0.9625)
BERT+Span+label_smoothing 0.9566(0.9568) 0.9624(0.9656) 0.9595(0.9612)

bert-ner-pytorch's People

Contributors

lonepatient avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert-ner-pytorch's Issues

关于分类准确率

运行run_ner_span,两个epoch就结束了,而且准确率很低??是我的运行参数设置错误吗
07/25/2020 03:30:00 - INFO - root - ***** Eval results *****
07/25/2020 03:30:00 - INFO - root - acc: 0.5564 - recall: 0.1478 - f1: 0.2335 - loss: 0.2043
07/25/2020 03:30:00 - INFO - root - ***** Entity results *****
07/25/2020 03:30:00 - INFO - root - ******* address results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.5385 - recall: 0.0563 - f1: 0.1019
07/25/2020 03:30:00 - INFO - root - ******* book results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.6000 - recall: 0.0974 - f1: 0.1676
07/25/2020 03:30:00 - INFO - root - ******* company results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.4541 - recall: 0.2354 - f1: 0.3101
07/25/2020 03:30:00 - INFO - root - ******* game results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.6018 - recall: 0.6814 - f1: 0.6391
07/25/2020 03:30:00 - INFO - root - ******* government results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.5000 - recall: 0.0405 - f1: 0.0749
07/25/2020 03:30:00 - INFO - root - ******* movie results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.6875 - recall: 0.2185 - f1: 0.3317
07/25/2020 03:30:00 - INFO - root - ******* name results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.4324 - recall: 0.0688 - f1: 0.1187
07/25/2020 03:30:00 - INFO - root - ******* organization results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.6769 - recall: 0.1199 - f1: 0.2037
07/25/2020 03:30:00 - INFO - root - ******* position results ********
07/25/2020 03:30:00 - INFO - root - acc: 1.0000 - recall: 0.0139 - f1: 0.0273
07/25/2020 03:30:00 - INFO - root - ******* scene results ********
07/25/2020 03:30:00 - INFO - root - acc: 0.3333 - recall: 0.0144 - f1: 0.0275
然后就自动结束了

predict

最好的模型是存在outputs/cluener_output/bert/pytorch_model.bin还是在outputs/cluener_output/bert/checkpoint-XXXX/pytorch_model.bin,看了predict的代码有点疑惑

预训练模型

请问你这里的bert预训练模型用的哪里的,我用的huggingface的bert-base-chinese,不但训练效果差,而且代码也一直报vocab的索引对不上的问题。

预训练文件在哪里下载呢

从google-research下载到的chinese_L-12_H-768_A-12只有bert_model.ckpt vocab.txt bert_config.json
但我看代码需要的不是这种文件
OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5', 'model.ckpt.index'] found in directory prev_trained_model/bert-base/bert-base-chinese or from_tf set to False

CLUENER结果复现不一致

你好。我采用script的run脚本中的超参数在单卡GPU上测试了下CLUENER的效果,各个模型都比你给出的数据低了1.5%左右。请问你README中的实验结果是用script下的超参跑出来的么?在几张卡上跑的结果?

关于mrc-ner的一些细节

对于一句话不包含实体,或包含多个实体,是怎样处理的?

切分maxseq的时候,是如何切分本来在一个自然段的实体的?

eval时的算法,此时的 groundtruth(gold)数量并不准确,使得结果和标准的conll ner evaluate的脚本不一致,不同的算法使用不同的matric,是否有可比性?

DiceLoss 这个公式写的对吗,怎么理解呢

def forward(self,input, target):
    '''
    input: [N, C]
    target: [N, ]
    '''
    prob = torch.softmax(input, dim=1)
    prob = torch.gather(prob, dim=1, index=target.unsqueeze(1))
    dsc_i = 1 - ((1 - prob) * prob) / ((1 - prob) * prob + 1)
    dice_loss = dsc_i.mean()
    return dice_loss

论文中是
DSC(Xi)= (2(1-p)p*y + r)/((1-p)p + y +r)

你好想问下dataset里的train.json 文件是什么

直接运行 sh run_ner_crf.py 出了如下错误:
Traceback (most recent call last):
File "run_ner_crf.py", line 496, in
main()
File "run_ner_crf.py", line 436, in main
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, data_type='train')
File "run_ner_crf.py", line 336, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir)
File "/home/wei/A_TestProject/BERT-NER-Pytorch-master/processors/ner_seq.py", line 204, in get_train_examples
return self._create_examples(self._read_json(os.path.join(data_dir, "train.json")), "train")
File "/home/wei/A_TestProject/BERT-NER-Pytorch-master/processors/utils_ner.py", line 75, in _read_json
with open(input_file,'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/wei/A_TestProject/BERT-NER-Pytorch-master/datasets/cluener/train.json'

初始化模型报错

您好,我在运行代码至初始化模型处,在models/transformers/modeling_utils.py文件的358行报错:
0
运行代码,控制台仅输出 ”Process finished with exit code -1“
进入报错函数内部调试时,输出以下报错信息:
1
2

请问这是什么原因呢?
我的运行环境是 python3.6 pytorch1.2

非常感谢!

mask in crf

您好,

请问用attention_mask做crf的mask的话,一个word假设有多个sub tokens,那这些tokens都就都keep了。在bert for ner里面,是用一个词的第一个token做的classification。

https://github.com/lonePatient/BERT-NER-Pytorch/blob/master/models/bert_for_ner.py#L64
同时在decode的时候

tags = model.crf.decode(logits, inputs['attention_mask'])

此处的mask也是attention mask。那么就会导致从CLS到SEP还有其中的所有token都会被keep,用于做decode。请问此处mask这样设置合理么?还是应该只保留每个word的第一个token呢?谢谢!

你好,我这里有个疑问!

preds = tags[0][1:-1] # [CLS]XXXX[SEP]

就是在predict的时候,为什么在crf的输出之后,你取了结果的[1:-1]呢?这个时候crf的输出的长度不应该是之前设置的max_seq_length的长度吗?这样取的话就不能表示去掉了[sep]这个标识符了?还是说在predict的时候,crf的输出是没有补到最大长度的?谢谢!

多gpu情况下的crf函数报错

02/25/2020 13:50:42 - INFO - root - ***** Running evaluation *****
02/25/2020 13:50:42 - INFO - root - Num examples = 1343
02/25/2020 13:50:42 - INFO - root - Batch size = 48
Traceback (most recent call last):
File "run_ner_crf.py", line 517, in
main()
File "run_ner_crf.py", line 459, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_ner_crf.py", line 148, in train
evaluate(args, model, tokenizer)
File "run_ner_crf.py", line 197, in evaluate
tags,_ = model.crf._obtain_labels(logits, args.id2label, inputs['input_lens'])
File "/root/.pyenv/versions/3.7.2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 591, in getattr
type(self).name, name))
AttributeError: 'DataParallel' object has no attribute 'crf'

经过排查,crf函数是自定义的,在多gpu的情况下,对model进行了DataParallel处理,DataParallel里面没有这个自定义的crf函数产生的。

找不到预训练模型

您好,我微调模型的时候发现找不到预训练模型,请问下中文版的pytorch的预训练模型从哪里下载呢?

loss 不收敛,指标全为0

pretrain_model 包含 config.json vocab.txt pytorch_model.bin

加载模型时出现这个log

image

训练过程中loss不收敛

image

Question? about bert-base

Thank you for sharing the code, may I ask
BERT-NER-Pytorch / prev_trained_model / bert-base /
Where can I download the pre-trained model “bert-base”, can you provide a download link?
Looking forward to your replies.

TypeError: __init__() got an unexpected keyword argument 'max_len'

使用作者自定义的CNerTokenizer会报错__init__() got an unexpected keyword argument 'max_len'
具体错误信息如下:
` File "BERT-NER-Pytorch-master/run_ner_softmax.py", line 549, in

main()

File "BERT-NER-Pytorch-master/run_ner_softmax.py", line 480, in main

cache_dir=args.cache_dir if args.cache_dir else None,)

File "BERT-NER-Pytorch-master\models\transformers\tokenization_utils.py", line 282, in from_pretrained

return cls._from_pretrained(*inputs, **kwargs)

File "BERT-NER-Pytorch-master\models\transformers\tokenization_utils.py", line 411, in _from_pretrained

tokenizer = cls(*init_inputs, **init_kwargs)

TypeError: init() got an unexpected keyword argument 'max_len'`

P.S. 使用BertTokenizer不会报错。还想请问下作者为什么要自定义分词器呢?难道BertTokenizer不会将没有在词表中的单词转化为<UNK>吗?

StopIteration error?

首先感谢大佬杰出的开源工作,正好匹配需求。
但是在具体运行时,出现如下报错,不知道是怎么回事,请大佬指教!
敬请回复!

07/10/2020 16:14:08 - INFO - root - ***** Running training *****
07/10/2020 16:14:08 - INFO - root - Num examples = 10748
07/10/2020 16:14:08 - INFO - root - Num Epochs = 4
07/10/2020 16:14:08 - INFO - root - Instantaneous batch size per GPU = 24
07/10/2020 16:14:08 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 48
07/10/2020 16:14:08 - INFO - root - Gradient Accumulation steps = 1
07/10/2020 16:14:08 - INFO - root - Total optimization steps = 896
Traceback (most recent call last):
File "run_ner_crf.py", line 497, in
main()
File "run_ner_crf.py", line 438, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_ner_crf.py", line 132, in train
outputs = model(**inputs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/bert_for_ner.py", line 58, in forward
outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/transformers/modeling_bert.py", line 606, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.