kamalkraj / bert-ner Goto Github PK
View Code? Open in Web Editor NEWPytorch-Named-Entity-Recognition-with-BERT
License: GNU Affero General Public License v3.0
Pytorch-Named-Entity-Recognition-with-BERT
License: GNU Affero General Public License v3.0
Hi, just figured it out. Inference with python does not get onto GPU. It seems that it just uses CPU. How can we push it into GPU? Is there any option to do so? Or it's not been implemented?
Thanks!
hope for reply, thanks!
06/09/2019 23:16:28 - INFO - main - ***** Running training *****
06/09/2019 23:16:28 - INFO - main - Num examples = 14041
06/09/2019 23:16:28 - INFO - main - Batch size = 32
06/09/2019 23:16:28 - INFO - main - Num steps = 2190
Epoch: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last): | 0/439 [00:00<?, ?it/s]
File "run_ner.py", line 534, in
main()
File "run_ner.py", line 430, in main
loss = model(input_ids, segment_ids, input_mask, label_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ub16c9/ub16_prj/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 1022, in forward
sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ub16c9/ub16_prj/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 628, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ub16c9/ub16_prj/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 198, in forward
embeddings = self.LayerNorm(embeddings)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 149, in forward
input, self.weight, self.bias)
File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 21, in forward
input_, self.normalized_shape, weight_, bias_, self.eps)
RuntimeError: a Tensor with 3145728 elements cannot be converted to Scalar (item at /pytorch/aten/src/ATen/native/Scalar.cpp:9)
When I read the code, I have concerns with the following part:
for i in range(batch_size):
jj = -1
for j in range(max_len):
if valid_ids[i][j].item() == 1:
jj += 1
valid_output[i][jj] = sequence_output[i][j]
why the valid_output is valid_output[i][jj], not valid_output[i][j]? Can you help me explain it?
Thanks.
i spent few minutes to fine tune on conll03 task?
so i think i was wrong with somewhere.
In my opinion, you should remove the 'X' label's signal in evaluation, because you add more label than standard dataset, so I can't know very well the F1-score increase because the more label of 'X'. I think the 'X' label is not equal the 'O' label in standard dataset and the BERT paper, but in your code it may be same.
Line 94 in c12b2ec
Have made it work for me locally with:
output = [ {"word": word, "tag": label, "confidence": confidence} for word, label, confidence in zip(words, labels, logits_confidence) ]
Line 116 in 48a868b
To begin with, thank you very much for sharing the code, it did save me a huge amount of time!
The if statement in line 116 should be in the for loop above, otherwise the output would be a list of tuples of a list of a sentence followed by a list of its corresponding tags eg:
(['-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.'],
['O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O']),
(['Nadim', 'Ladki'], ['B-PER', 'I-PER']),
(['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06'],
['B-LOC', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O'])]
If the desired output is as suggested in the code, i.e.
[ ['EU', 'B-ORG'], ['rejects', 'O'], ['German', 'B-MISC'], ['call', 'O'], ['to', 'O'], ['boycott', 'O'], ['British', 'B-MISC'], ['lamb', 'O'], ['.', 'O'] ]
then the if statement could be modified as:
if len(sentence) > 0: sentence.extend(label); data.append(sentence); sentence = []; label = []
If use the small --max_seq_length (in example bellow 32), we get SEP in results.
The lower max_seq_length the greater SEPs
precision recall f1-score support
LOC 0.9248 0.9248 0.9248 1529
PER 0.9556 0.9582 0.9569 1436
ORG 0.8860 0.8993 0.8926 1539
MISC 0.7620 0.8242 0.7919 637
[SEP] 1.0000 1.0000 1.0000 637
avg / total 0.9125 0.9235 0.9178 5778
from the original paper, the inputs contains three parameters, they are input_ids,input_mask,segment_ids, but I saw your code including valid_positions which is difficult for me to understand, can you expalin that for me? thanks
When running BERT NER on a single container the code executes fine, but when scaled to run across 2+ containers the speed drops drastically (from 20 seconds -> 20 minutes)
This has been isolated as definitely being a problem with BERT as it does not happen without it, wondering if you have come across anything like this
The problem specifically comes about when calling model.predict()
for one sentence it takes ~20s to output a prediction
EDIT:
it is slowing down at lines 88-89 in bert.py:
with torch.no_grad():
logits = self.model(input_ids, segment_ids, input_mask,valid_ids)
the latest versions about wordpiece and label is Jim Hen ##son was a puppet ##eer [Jim , Hen , was , a, puppet]?
only get first token about a word which is tokenizer by wordpiece?
have you get other methods to do experimental comparison?
Hope for you reply ^^
Thanks
While training custom NER with large model as pretrained one - Getting " "Weights sum to zero, can't be normalized") ZeroDivisionError: Weights sum to zero, can't be normalized" Error
Hello guys,
Anytime I try to run the script, I get this error.
Any suggestions on how to fix it?
(base) C:\Users\user1\Desktop\BERT-NER-experiment>activate neuro
(neuro) C:\Users\user1\Desktop\BERT-NER-experiment>python
Python 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bi
t (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
from bert import Ner
Better speed can be achieved with apex installed from https://www.github.com/nvi
dia/apex.
model = Ner("out_!x/")
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\user1\Desktop\BERT-NER-experiment\bert.py", line 35, in ini
t
self.model , self.tokenizer, self.model_config = self.load_model(model_dir)
File "C:\Users\user1\Desktop\BERT-NER-experiment\bert.py", line 48, in load_
model
model.load_state_dict(torch.load(output_model_file))
File "C:\Users\user1\Anaconda3\envs\neuro\lib\site-packages\torch\serializat
ion.py", line 387, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "C:\Users\user1\Anaconda3\envs\neuro\lib\site-packages\torch\serializat
ion.py", line 574, in _load
result = unpickler.load()
File "C:\Users\user1\Anaconda3\envs\neuro\lib\site-packages\torch\serializat
ion.py", line 537, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "C:\Users\user1\Anaconda3\envs\neuro\lib\site-packages\torch\serializat
ion.py", line 119, in default_restore_location
result = fn(storage, location)
File "C:\Users\user1\Anaconda3\envs\neuro\lib\site-packages\torch\serializat
ion.py", line 95, in _cuda_deserialize
device = validate_cuda_device(location)
File "C:\Users\user1\Anaconda3\envs\neuro\lib\site-packages\torch\serializat
ion.py", line 79, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.i
s_available() is False. If you are running on a CPU-only machine, please use tor
ch.load with map_location='cpu' to map your storages to the CPU.output = model.predict("Steve went to Paris")
Traceback (most recent call last):
File "", line 1, in
NameError: name 'model' is not defined
Hey, I would like to know how fast were your predictions for a single request with multiple entities? And did you perform any load testing, if so what are the results?
Also I would like to know approaches for fine-tuning a custom NER model using BERT. If you know any approaches, please help me.
Thanks.
Because of the syntax error I had to change this line from a list
output = [word:{"tag":label,"confidence":confidence} for word,label,confidence in zip(words,labels,logits_confidence)]
to a dictionary
output = {word:{"tag":label,"confidence":confidence} for word,label,confidence in zip(words,labels,logits_confidence)}
When I pass --fp16 parameter to train faster, it gives the following error:
Traceback (most recent call last): | 0/1 [00:00<?, ?it/s]
File "run_ner.py", line 594, in <module>
main()
File "run_ner.py", line 487, in main
loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "run_ner.py", line 46, in forward
logits = self.classifier(sequence_output)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/root/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1371, in linear
output = input.matmul(weight.t())
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #2 'mat2'
Hello ,
finally, excellent script for bert-NER. I am just wondering if this script can be used(slight changes) to train a token based classification task. i.e. similar to NER task but the token(target word) to be classified are given in advance. For example, train a model for word-sense disambiguation. Given a word in a sentence determine/classify its sense.
e.g. "He went to the store" went here has sense βmotionβ. the target word here is "went"
Any idea?
ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/BERT-NER+kamalkraj$ python3.5 run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.4
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
06/11/2019 19:02:56 - INFO - main - device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
06/11/2019 19:02:57 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /home/ub16c9/.pytorch_pretrained_bert/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
06/11/2019 19:02:58 - INFO - pytorch_pretrained_bert.modeling - loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz from cache at /home/ub16c9/.pytorch_pretrained_bert/distributed_-1/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c
06/11/2019 19:02:58 - INFO - pytorch_pretrained_bert.modeling - extracting archive file /home/ub16c9/.pytorch_pretrained_bert/distributed_-1/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c to temp dir /tmp/tmpyj8ar20e
06/11/2019 19:03:01 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
06/11/2019 19:03:05 - INFO - pytorch_pretrained_bert.modeling - Weights of BertForTokenClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
06/11/2019 19:03:05 - INFO - pytorch_pretrained_bert.modeling - Weights from pretrained model not used in BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
06/11/2019 19:03:07 - INFO - main - *** Example ***
06/11/2019 19:03:07 - INFO - main - guid: train-0
06/11/2019 19:03:07 - INFO - main - tokens: EU rejects German call to boycott British la ##mb .
06/11/2019 19:03:07 - INFO - main - input_ids: 101 7270 22961 1528 1840 1106 21423 1418 2495 12913 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - *** Example ***
06/11/2019 19:03:07 - INFO - main - guid: train-1
06/11/2019 19:03:07 - INFO - main - tokens: Peter Blackburn
06/11/2019 19:03:07 - INFO - main - input_ids: 101 1943 14428 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - input_mask: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - *** Example ***
06/11/2019 19:03:07 - INFO - main - guid: train-2
06/11/2019 19:03:07 - INFO - main - tokens: BR ##US ##SE ##LS 1996 - 08 - 22
06/11/2019 19:03:07 - INFO - main - input_ids: 101 26660 13329 12649 15928 1820 118 4775 118 1659 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - *** Example ***
06/11/2019 19:03:07 - INFO - main - guid: train-3
06/11/2019 19:03:07 - INFO - main - tokens: The European Commission said on Thursday it disagreed with German advice to consumers to s ##hun British la ##mb until scientists determine whether mad cow disease can be transmitted to sheep .
06/11/2019 19:03:07 - INFO - main - input_ids: 101 1109 1735 2827 1163 1113 9170 1122 19786 1114 1528 5566 1106 11060 1106 188 17315 1418 2495 12913 1235 6479 4959 2480 6340 13991 3653 1169 1129 12086 1106 8892 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - *** Example ***
06/11/2019 19:03:07 - INFO - main - guid: train-4
06/11/2019 19:03:07 - INFO - main - tokens: Germany ' s representative to the European Union ' s veterinary committee Werner Z ##wing ##mann said on Wednesday consumers should buy sheep ##me ##at from countries other than Britain until the scientific advice was clearer .
06/11/2019 19:03:07 - INFO - main - input_ids: 101 1860 112 188 4702 1106 1103 1735 1913 112 188 27431 3914 14651 163 7635 4119 1163 1113 9031 11060 1431 4417 8892 3263 2980 1121 2182 1168 1190 2855 1235 1103 3812 5566 1108 27830 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:07 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:03:10 - INFO - main - ***** Running training *****
06/11/2019 19:03:10 - INFO - main - Num examples = 14041
06/11/2019 19:03:10 - INFO - main - Batch size = 32
06/11/2019 19:03:10 - INFO - main - Num steps = 2190
Epoch: 40%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2/5 [07:01<10:33, 211.32s/it^Epoch: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [17:28<00:00, 209.73s/it]
06/11/2019 19:20:40 - INFO - main - *** Example ***ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 439/439 [03:27<00:00, 2.24it/s]
06/11/2019 19:20:40 - INFO - main - guid: dev-0
06/11/2019 19:20:40 - INFO - main - tokens: CR ##IC ##KE ##T - L ##EI ##CE ##ST ##ER ##S ##H ##IR ##E T ##A ##KE O ##VE ##R AT TO ##P A ##FT ##ER IN ##NI ##NG ##S VI ##CT ##OR ##Y .
06/11/2019 19:20:40 - INFO - main - input_ids: 101 15531 9741 22441 1942 118 149 27514 10954 9272 9637 1708 3048 18172 2036 157 1592 22441 152 17145 2069 13020 16972 2101 138 26321 9637 15969 27451 11780 1708 7118 16647 9565 3663 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - *** Example ***
06/11/2019 19:20:40 - INFO - main - guid: dev-1
06/11/2019 19:20:40 - INFO - main - tokens: L ##ON ##D ##ON 1996 - 08 - 30
06/11/2019 19:20:40 - INFO - main - input_ids: 101 149 11414 2137 11414 1820 118 4775 118 1476 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - *** Example ***
06/11/2019 19:20:40 - INFO - main - guid: dev-2
06/11/2019 19:20:40 - INFO - main - tokens: West Indian all - round ##er Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .
06/11/2019 19:20:40 - INFO - main - input_ids: 101 1537 1890 1155 118 1668 1200 5676 14068 1261 1300 1111 3383 1113 5286 1112 21854 3222 8860 1118 1126 6687 1105 3614 2326 1107 1160 1552 1106 1321 1166 1120 1103 1246 1104 1103 2514 2899 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - *** Example ***
06/11/2019 19:20:40 - INFO - main - guid: dev-3
06/11/2019 19:20:40 - INFO - main - tokens: Their stay on top , though , may be short - lived as title rivals Essex , Derbyshire and Surrey all closed in on victory while Kent made up for lost time in their rain - affected match against Nottinghamshire .
06/11/2019 19:20:40 - INFO - main - input_ids: 101 2397 2215 1113 1499 117 1463 117 1336 1129 1603 118 2077 1112 1641 9521 8493 117 15964 1105 9757 1155 1804 1107 1113 2681 1229 5327 1189 1146 1111 1575 1159 1107 1147 4458 118 4634 1801 1222 21942 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - *** Example ***
06/11/2019 19:20:40 - INFO - main - guid: dev-4
06/11/2019 19:20:40 - INFO - main - tokens: After bowling Somerset out for 83 on the opening morning at Grace Road , Leicestershire extended their first innings by 94 runs before being bowled out for 29 ##6 with England disc ##ard Andy C ##ad ##dick taking three for 83 .
06/11/2019 19:20:40 - INFO - main - input_ids: 101 1258 11518 8860 1149 1111 6032 1113 1103 2280 2106 1120 4378 1914 117 21854 2925 1147 1148 6687 1118 5706 2326 1196 1217 21663 1149 1111 1853 1545 1114 1652 6187 2881 4827 140 3556 25699 1781 1210 1111 6032 119 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:40 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
06/11/2019 19:20:41 - INFO - main - ***** Running evaluation *****
06/11/2019 19:20:41 - INFO - main - Num examples = 3250
06/11/2019 19:20:41 - INFO - main - Batch size = 8
Evaluating: 11%|βββββββββββββββββββββββββ | 45/407 [00:01<00:14, 24.73it/s]
Traceback (most recent call last):
File "run_ner.py", line 534, in
main()
File "run_ner.py", line 518, in main
temp_2.append(label_map[logits[i][j]])
KeyError: 0
ub16c9@ub16c9-gpu
In the pre-trained example provided, try changing the casing in the sentence.
output = model.predict("Steve went to paris")
{'paris': {'tag': 'O', 'confidence': 0.9998948574066162},
'Steve': {'tag': 'B-PER', 'confidence': 0.9998831748962402}}
output = model.predict("steve went to Paris")
{'Paris': {'tag': 'B-LOC', 'confidence': 0.9998199343681335},
'steve': {'tag': 'O', 'confidence': 0.9998823404312134}}
maybe training should be done uncased?
Can I use my own dataset with the similar format with conll2003 which only have word and tag? And the tag is different to the conll2003 but still comply to IOB2.
Hi! Thanks for your work. I'm trying to run the model on my train-test-valid set using pre-trained weights from your google drive link but I don't see the right parameters to give when calling run_ner script.
Hi @kamalkraj, nice work! I'm wondering how I can continue the training of a pre-trained CoNLL'03 BERT-NER model on a separate dataset? What POS tagger and chunker should I use to get proper train.text/valid.txt/test.txt files? How to start from the checkpoint and avoid the re-training of a model from scratch?
Hi, thanks for your great job.
I have a question here, why do you predict the labels of [CLS] and [SEP] rather than simply mask them? Will it improve the performance of the NER task?
In the following code block :
if m and label_map[label_ids[i][j]] != "X":
temp_1.append(label_map[label_ids[i][j]])
temp_2.append(label_map[logits[i][j]])
else:
temp_1.pop()
temp_2.pop()
y_true.append(temp_1)
y_pred.append(temp_2)
break
why is temp_1
and temp_2
popped if m and label_map[label_ids[i][j]] == "X"
?
Shouldn't the code block looks more like :
if m:
if label_map[label_ids[i][j]] != "X":
temp_1.append(label_map[label_ids[i][j]])
temp_2.append(label_map[logits[i][j]])
else:
temp_1.pop()
temp_2.pop()
y_true.append(temp_1)
y_pred.append(temp_2)
break
Hi, Thank you sharing the model . I am getting an error while training the model .
Getting warmup_linear library is not available .
Could you please help me how to solve this issue ?
Regards,
niranjan
line no 507:
lr_this_step = args.learning_rate * warmup_linear(global_step/num_train_optimization_steps, args.warmup_proportion)
I have noticed attention_mask_label in your experiment code, but why you set it to none to avoid using it? Is the performance worse if you only use active parts of loss?
Thanks, for the great project.
I checked the train format.
it contains text, pos , bio-tag, entity tag(bio-schema)
for example:
-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
I having a 10,000 own sentence data, I can do entity tagging for my dataset. but How can i do pos and bio-tag is there any python library especially for bio-tag.
Thanks
Is it possible to implement parallelized workers for the NER task, like in this repo? This does not have support for PyTorch models.
Any suggestions?
My requirement here is given a sentence(sequence), I would like to just extract the entities present in the sequence without classifying them to a type in the NER task. I see that BertForTokenClassification for NER does the classification. Can this be adapted for just the extraction?
Can you give me an idea of how to do entity extraction/identification using BERT?
Hey @kamalkraj , thanks for your work.
I am trying to run this code on my AMD Vega 10 XT. Is there any way you can help me because your code is on CUDA.
Thnaks in advance.
Nice work! Do you plan to make a license available for the code and pretrained model?
Hey,
I was looking at the code and noticed that there is an error in the ordering of parameters given to BERT in line 486 of run_ner.py
file
BertForTokenClassification (https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertForTokenClassification) expects the order to be input_ids
, attention_mask
, token_type_ids
, position_ids
, head_mask
, labels
whereas in the code as one can see below you have passed segment_ids
(== token_type_ids
) before input_mask
(== attention_mask
). Also label_ids
(== labels
) and l_mask
(== position_ids
) should also be switched.
Line 485-486
input_ids, input_mask, segment_ids, label_ids, valid_ids,l_mask = batch
loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask)
The above also in line 555
Correct me if I've understood incorrectly
Hi, I would like to recognize entities that are made up of more than one word: e.g.
Stephen King
The Art of War
United States of America
etc...
Your program splits each word making this impossible. Any workaround for this?
Is this using the older BERT version or BERT-NER Version 2 ? Thanks
After training the model using the default parameters, the result is
` precision recall f1-score support
ORG 0.7019 0.7685 0.7337 337
LOC 0.8190 0.8854 0.8509 419
MISC 0.8188 0.8278 0.8233 273
PER 0.7619 0.7453 0.7535 322
avg / total 0.7761 0.8113 0.7929 1351
`
which is different from the README results.
Hi @kamalkraj !
Nice repo.
If a sentence has length more than 128 how do you predict NER tags for those sentences?
Especially for test data.
Hi @kamalkraj, nice work ! It helps me a lot.
I'm wondering is this dataset the (CoNLL-03) dataset?
after downloading your pretrained model in the master branch and run this command:
'''
python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out --max_seq_length=128 --num_train_epochs 5 --do_eval --warmup_proportion=0.4
'''
I got the F1 score of 90.9 in the test set, which is far away from what you reported. Could you help with my issue? Thanks!
Hi @kamalkraj, nice work! I noticed the training in the experiment branch is much slower than that in the master branch, which might be caused by the two for loop in forward pass:
for i in range(batch_size):
jj = -1
for j in range(max_len):
if valid_ids[i][j].item() == 1:
jj += 1
valid_output[i][jj] = sequence_output[i][j]
Instead, we could use
valid_mask = valid_ids.eq(1)
for i in range(batch_size):
valid_mask_b = valid_mask[i]
mask_len = torch.sum( valid_mask_b )
valid_output[i, :mask_len] = sequence_output[i][ valid_mask_b ]
that has the same performance and similar speed as master branch.
Hi Kamal, Thank you sharing the code . I installed pytorch-pretrained-bert 0.6.2 version in my PC and run your code . I am getting below error while executing the code. Can you please guide me how to solve this issue using latest BERT 0.6.2 version?
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
ImportError: cannot import name 'warmup_linear' from 'pytorch_pretrained_bert.optimization' (C:\ProgramData\Anaconda3\lib\site-packages\pytorch_pretrained_bert\optimization.py)
lr_this_step = args.learning_rate * warmup_linear(global_step/num_train_optimization_steps, args.warmup_proportion)
Is there other way we can change the code ?
Regards,
Niranjan
Hi @kamalkraj, nice work ! It helps me a lot.
I'm wondering why the support of the valid dataset and test dataset results in this branch is much smaller than the other branch in your results?
I used your pretrained model in this dataset, but only get F1 = 0.9078 in test.txt.
I have custom entities data of around 8 entities. I combined that dataset with the conll2003 (As I am interested in conll2003 entities also). I trained the model. Though the trained model is unable to predict any entities outside conll2003. Could you please help me if I am missing anything while training on custom dataset.
I used below command to train the model.
nohup python3 run_ner.py --data_dir=data --bert_model=bert-large-cased --task_name=ner --output_dir=out_bert_large --max_seq_length=128 --num_train_epochs 10 --do_train --do_eval --no_cuda --warmup_proportion=0.4 > log.txt &
Hey Kamal,
Great work on the repo and major thanks for hosting the trained model.
Would it be possible to add N-Gram support to this model, for say, 'New York' instead of detecting on 'New' and then 'York'?
Also, any plans to train a larger model for QA or NER? Say RoBERTa or XL-NET?
Thanks Kamal for wonderful work.
I am seeking for some help on how can i keep a check on the co reference of entities in a doc. For exam Person names as James Paul appears 10 times in the document ( which can be any one of James or Paul or James Paul). Can you suggest me some ideas on how to list up all the mentions together. It can get tricky if the doc has like two person as James Paul and James Real so how would one can find which James is being referred in the doc.
Sorry, just looking for some help
Thanks
The model fail to respond back when a text contains quotes in it.
For example,
{
"text" : "Steve went to Paris. He said, "Paris is amazing city." "
}
It doesn't work whenever the text has single/double quotes and i am supposed to maintain the quotes in the text while working through the model so i cant remove in preprocessing as well
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.