tomaarsen / spanmarkerner Goto Github PK
View Code? Open in Web Editor NEWSpanMarker for Named Entity Recognition
Home Page: https://tomaarsen.github.io/SpanMarkerNER/
License: Apache License 2.0
SpanMarker for Named Entity Recognition
Home Page: https://tomaarsen.github.io/SpanMarkerNER/
License: Apache License 2.0
Gives this error:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
from datasets import load_dataset, Dataset
dataset = load_dataset("json", data_files=["output.jsonl"])
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained(
"bert-base-uncased", # Example encoder
labels=['O','Degree','Years_of_Experience','Email_Address'
'College_Name','Location','Designation','Graduation_Year','Skills','Name'
'Companies_worked_at'],
max_prev_context=2,
max_next_context=2,
)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir="models/RUDYRDX-NER-1",
learning_rate=1e-5,
gradient_accumulation_steps=2,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=1,
evaluation_strategy="steps",
save_strategy="steps",
eval_steps=500,
push_to_hub=False,
logging_steps=50,
fp16=True,
warmup_ratio=0.1,
)
from span_marker import Trainer
trainer = Trainer(
model=model,
args=args,
train_dataset=dataset['train'],
)
trainer.train() # error happens when this runs
{"document_id": 0, "sentence_id": 0, "tokens": ["Govardhana", "K", "Senior", "Software", "Engineer", "Bengaluru", "Karnataka", "Karnataka", "-", "Email", "Indeed", ":", "indeed.com/r/Govardhana-K/", "b2de315d95905b68", "Total", "experience", "5", "Years", "6", "Months", "Cloud", "Lending", "Solutions", "INC", "4", "Month", "Salesforce", "Developer", "Oracle", "5", "Years", "2", "Month", "Core", "Java", "Developer", "Languages", "Core", "Java", "Go", "Lang", "Oracle", "PL-SQL", "programming", "Sales", "Force", "Developer", "APEX", "."], "ner_tags": ["Name", "Designation", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"document_id": 0, "sentence_id": 1, "tokens": ["Designations", "&", "Promotions", "Willing", "relocate", ":", "Anywhere", "WORK", "EXPERIENCE", "Senior", "Software", "Engineer", "Cloud", "Lending", "Solutions", "-", "Bangalore", "Karnataka", "-", "January", "2018", "Present", "Present", "Senior", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2016", "December", "2017", "Staff", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "January", "2014", "October", "2016", "Associate", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2012", "December", "2013", "EDUCATION", "B.E", "Computer", "Science", "Engineering", "Adithya", "Institute", "Technology", "-", "Tamil", "Nadu", "September", "2008", "June", "2012", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "SKILLS", "APEX", "."], "ner_tags": ["Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation"]}
{"document_id": 0, "sentence_id": 2, "tokens": ["(", "Less", "1", "year", ")", "Data", "Structures", "(", "3", "years", ")", "FLEXCUBE", "(", "5", "years", ")", "Oracle", "(", "5", "years", ")", "Algorithms", "(", "3", "years", ")", "LINKS", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/", "ADDITIONAL", "INFORMATION", "Technical", "Proficiency", ":", "Languages", ":", "Core", "Java", "Go", "Lang", "Data", "Structures", "&", "Algorithms", "Oracle", "PL-SQL", "programming", "Sales", "Force", "APEX", "."], "ner_tags": ["Name", "Name", "Name", "Designation", "Designation", "Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O"]}
{"document_id": 0, "sentence_id": 3, "tokens": ["Tools", ":", "RADTool", "Jdeveloper", "NetBeans", "Eclipse", "SQL", "developer", "PL/SQL", "Developer", "WinSCP", "Putty", "Web", "Technologies", ":", "JavaScript", "XML", "HTML", "Webservice", "Operating", "Systems", ":", "Linux", "Windows", "Version", "control", "system", "SVN", "&", "Git-Hub", "Databases", ":", "Oracle", "Middleware", ":", "Web", "logic", "OC4J", "Product", "FLEXCUBE", ":", "Oracle", "FLEXCUBE", "Versions", "10.x", "11.x", "12.x", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/"], "ner_tags": ["Name", "Name", "Designation", "Designation", "Designation", "Location", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O"]}
Hi,
Do you have any clues about implementing the relation identifying part, which follows the entity recognition?
I am looking into the sources code and thinking about implementing that maybe. It could be useful if you have any suggestions of doing that.
Thanks
Hi tom, i was wondering how could i make this work for overlapping spans for different entity types and how to extend this to relations extraction as well.
Any help or direction would be super helpful.
Hello
it is really an amazing work, but I wonder is there any minimum number of token required?
when I tried inference on a short sentence or question, it just returned an empty JSON.
class SpanMarkerTokenizer:
def init(self, tokenizer: PreTrainedTokenizer, config: SpanMarkerConfig, **kwargs) -> None:
self.tokenizer = tokenizer
self.config = config
tokenizer.add_tokens(["<start>", "<end>"], special_tokens=True)
self.start_marker_id, self.end_marker_id = self.tokenizer.convert_tokens_to_ids(["<start>", "<end>"])
Hi,
In the above code, you defined and for tokenizer but I couldn't find where they are used. Potentially intended for relation extraction?
Really great library @tomaarsen, thank you for this great contribution!!
It would be really convenient to have the possibility to give a list of class-candidates to the predict method during inference. This would be useful for Retriever-Reader systems, where the Retriever (e.g. Setfit-Model) returns text sequences where it is already known what set of classes are available for the Reader (e.g. SpanMarker) for extraction and you do not want to extract other classes.
E.g. a system like this from here https://lilianweng.github.io/posts/2020-10-29-odqa/:
I was thinking about modifications to the predict method like these:
def predict(self, ... , class_candidates: Optional[List[str]] = None):
...
if class_candidates is not None:
# convert class names to class ids
label2id = self.config.label2id
class_candidate_ids = [label2id[c] for c in class_candidates if c in label2id]
for batch_start_idx in trange(0, len(dataset), batch_size, leave=True, disable=not show_progress_bar):
...
# Computing probabilities based on the logits
probs = output.logits.softmax(-1)
# Mask everything except class-candidate probabilities
if class_candidates is not None:
mask = torch.zeros_like(probs)
mask[:, :, class_candidate_ids] = 1
probs = probs * mask
# Get the labels and the correponding probability scores
scores, labels = probs.max(-1)
...
return all_entities
I did not find time to have a deep dive, implement & test it, but I think this could be a useful feature.
Was wondering if there is a way to load a model in a kaggle notebook that I trained myself. There's currently a NER competition going on, and I wanted to try using the SpanMarker library to compete. Training went fine, but now to submit, I need to have the kaggle notebook have internet disabled. When trying to load my checkpoint, I get this error:
model_checkpoint = "/kaggle/input/pii-train-1-cp3000/Kaggle Checkpoints/checkpoint 3000"
model = SpanMarkerModel.from_pretrained(model_checkpoint,local_files_only = True,
labels = [
'1-EMAIL', '1-ID_NUM', '1-NAME_STUDENT', '1-PHONE_NUM', '1-STREET_ADDRESS',
'1-URL_PERSONAL', '1-USERNAME', '2-ID_NUM', '2-NAME_STUDENT', '2-PHONE_NUM',
'2-STREET_ADDRESS', '2-URL_PERSONAL', 'O'
])
OSError: We couldn't connect to 'https://huggingface.co/' to load this file, couldn't find it in the cached files and it looks like bert-base-uncased is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
Kaggle notebook here: https://www.kaggle.com/jdonnelly0804/pii-infer
The AutoLabelNormalizer
infers the scheme based on the presence of all the tag prefixes - i.e. BILOU is assumed if there's at least one of each of ['B','I','L','O','U']. There doesn't appear to be any way of passing a specific LabelNormalizer
to the trainer.
My issue with this is that my dataset contains only BILO - i.e., there are no singletons. But the nature of my problem means I need to use a scheme that has B and L tags.
Because my dataset has no U tags SpanMarker errors since the set BILO doesn't match any of cases that AutoLabelNormalizer
checks, i.e. perhaps allow a LabelNormalizer to be passed as an argument or relax the conditions i..e
if (tags == set("BILOU")) or (tags == set("BILO")):
return LabelNormalizerBILOU(config)
Hello Tom.
I have found a problem when loading the tokenizer directly from the SpanMarkerTokenizer class.
I have tested it with several repo ids and the response is the same in all of them.
Platform: MacOS Sonoma 14.0, M1 Pro
Python 3.11.5
transformers=4.35.0
span_marker=1.5.0
tokenizers=0.14.1
tokenizer = SpanMarkerTokenizer.from_pretrained("tomaarsen/span-marker-mbert-base-multinerd")
Downloading (…)okenizer_config.json: 100%|███| 343/343 [00:00<00:00, 834kB/s]
Downloading (…)solve/main/vocab.txt: 100%|█| 996k/996k [00:00<00:00, 3.36MB/s
Downloading (…)/main/tokenizer.json: 100%|█| 2.92M/2.92M [00:00<00:00, 5.46MB
Downloading (…)in/added_tokens.json: 100%|█| 43.0/43.0 [00:00<00:00, 114kB/s]
Downloading (…)cial_tokens_map.json: 100%|███| 125/125 [00:00<00:00, 384kB/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/polodealvarado/Desktop/github_projects/SpanMarkerNER/span_marker/tokenizer.py", line 285, in from_pretrained
return cls(tokenizer, config=config, **kwargs)
File "/Users/polodealvarado/Desktop/github_projects/SpanMarkerNER/span_marker/tokenizer.py", line 156, in __init__
self.tokenizer.model_max_length, self.config.model_max_length or self.config.model_max_length_default
AttributeError: 'NoneType' object has no attribute 'model_max_length'
Get the tokenizer
Using your spacy integration the senticizer for Spacy will sometimes produce an empty sentence (using "en_core_web_sm"). These leads to the SpanMarkerTokenizer throwing an exception. Not sure how active this project is any more, but these seems like an easy fix. Is there a work-around already for this? Would you like the code updated to have one (I might be able to do this fix).
When I place a single word in the first index of the list, it leads to the above error. However, if I put it in any index other than the first, no error occurs.
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
entities = model.predict( ['Avolon', 'Walmart - Milwaukee, WI']) #error
entities = model.predict( [ 'Walmart - Milwaukee, WI','Avolon']) #no error
Can you please help me here @tomaarsen
I was hoping this would work for optimizing the underlying model.encoder
since it is independent(?) of the rest
but I'm getting a shape error like :
RuntimeError: shape '[1, 512]' is invalid for input of size 262144
basically saying it expected [512, 512] for attention_mask
which is weird because attention_mask input is shaped [512, 512]
Here is the test code:
from span_marker import SpanMarkerModel
from optimum.bettertransformer import BetterTransformer
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super").eval()
# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
print(entities) # works
better_encoder = BetterTransformer.transform(model.encoder)
model.encoder=better_encoder
# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
print(entities)
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/span_marker/modeling.py:137 in forward │
│ │
│ 134 │ │ │ SpanMarkerOutput: The output dataclass. │
│ 135 │ │ """ │
│ 136 │ │ token_type_ids = torch.zeros_like(input_ids) │
│ ❱ 137 │ │ outputs = self.encoder( │
│ 138 │ │ │ input_ids, │
│ 139 │ │ │ attention_mask=attention_mask, │
│ 140 │ │ │ token_type_ids=token_type_ids, │
│ │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in │
│ _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1 │
│ 020 in forward │
│ │
│ 1017 │ │ │ inputs_embeds=inputs_embeds, │
│ 1018 │ │ │ past_key_values_length=past_key_values_length, │
│ 1019 │ │ ) │
│ ❱ 1020 │ │ encoder_outputs = self.encoder( │
│ 1021 │ │ │ embedding_output, │
│ 1022 │ │ │ attention_mask=extended_attention_mask, │
│ 1023 │ │ │ head_mask=head_mask, │
│ │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in │
│ _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:6 │
│ 10 in forward │
│ │
│ 607 │ │ │ │ │ encoder_attention_mask, │
│ 608 │ │ │ │ ) │
│ 609 │ │ │ else: │
│ ❱ 610 │ │ │ │ layer_outputs = layer_module( │
│ 611 │ │ │ │ │ hidden_states, │
│ 612 │ │ │ │ │ attention_mask, │
│ 613 │ │ │ │ │ layer_head_mask, │
│ │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in │
│ _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/optimum/bettertransformer/models/encoder_mo │
│ dels.py:246 in forward │
│ │
│ 243 │ │ │ # attention mask comes in with values 0 and -inf. we convert to torch.nn.Tra │
│ 244 │ │ │ # 0->false->keep this token -inf->true->mask this token │
│ 245 │ │ │ attention_mask = attention_mask.bool() │
│ ❱ 246 │ │ │ attention_mask = torch.reshape(attention_mask, (attention_mask.shape[0], att │
│ 247 │ │ │ hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mas │
│ 248 │ │ │ attention_mask = None │
│ 249 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[1, 512]' is invalid for input of size 262144
transformers
version: 4.29.2Not sure what works better during inference (individual sentences or longer segments in larger batches, but maybe something like this could work.
def pipe(self, stream, batch_size=128, include_sent=None):
"""
predict the class for a spacy Doc stream
Args:
stream (Doc): a spacy doc
Returns:
Doc: spacy doc with spanmarker entities
"""
if isinstance(stream, str):
stream = [stream]
if not isinstance(stream, types.GeneratorType):
stream = self.nlp.pipe(stream, batch_size=batch_size)
for docs in util.minibatch(stream, size=batch_size):
batch_results = self.model.predict(docs)
for doc, prediction in zip(docs, batch_results):
yield self.post_process_batch(doc, prediction)
When one of the elements in the training set is empty, then it ends up throwing a confusing error:
Label normalizing the train dataset: 100%|██████████████████████████████████████████████████████████████████████| 8324/8324 [00:00<00:00, 34016.14 examples/s]
Tokenizing the train dataset: 96%|██████████████████████████████████████████████████████████████████████████▉ | 8000/8324 [00:04<00:00, 1665.71 examples/s]c:\code\span-marker-ner\span_marker\tokenizer.py:204: RuntimeWarning: All-NaN slice encountered
num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Tokenizing the train dataset: 96%|██████████████████████████████████████████████████████████████████████████▉ | 8000/8324 [00:04<00:00, 1612.60 examples/s]
This SpanMarker model will ignore 3.181189% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 5 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 18798 total entities:
- 203 missed entities with 6 words (1.079902%)
- 81 missed entities with 7 words (0.430897%)
- 58 missed entities with 8 words (0.308543%)
- 29 missed entities with 9 words (0.154272%)
- 5 missed entities with 10 words (0.026599%)
- 9 missed entities with 11 words (0.047877%)
- 8 missed entities with 12 words (0.042558%)
- 1 missed entities with 13 words (0.005320%)
- 1 missed entities with 14 words (0.005320%)
- 1 missed entities with 15 words (0.005320%)
- 2 missed entities with 16 words (0.010639%)
- 1 missed entities with 17 words (0.005320%)
Additionally, a total of 199 (1.058623%) entities were missed due to the maximum input length.
Traceback (most recent call last):
File "c:\code\span-marker-ner\demo_conll2002.py", line 83, in <module>
main()
File "c:\code\span-marker-ner\demo_conll2002.py", line 72, in main
trainer.train()
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1553, in train
return inner_training_loop(
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1567, in _inner_training_loop
train_dataloader = self.get_train_dataloader()
File "c:\code\span-marker-ner\span_marker\trainer.py", line 423, in get_train_dataloader
self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
File "c:\code\span-marker-ner\span_marker\trainer.py", line 241, in preprocess_dataset
dataset = dataset.map(
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3097, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "c:\code\span-marker-ner\span_marker\tokenizer.py", line 204, in __call__
num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer
Perhaps a cleaner error can be designed here.
Hi here @tomaarsen!
It seems that the Hugging Face Space linked as 🤗 Space is not working as it throws an HTTP 404 error code, and it seems is no longer in your Hugging Face Hub account.
So I just wanted to point that out in case you want to have a running example besides the Free Inference API, otherwise I guess you're good removing it!
Hey Tom, thank you for this amazing framework!
May I ask why you do not require any pre-tokenization (e.g., subword-tokenization, setting special tokens to -100) based on the chosen model using AutoTokenizer.from_pretrained()?
Might pre-tokenization improve or rather degrade model performance?
Best regards,
Daniel
I have created a pipeline like so:
self.model = spacy.load("en_core_web_md", disable=[
"tagger",
"lemmatizer",
"attribute_ruler",
"ner",])
self.model.add_pipe(
"span_marker",
config={"model": span_marker_model_path, "batch_size": batch_size},
)
I call pipe()
on a stream of documents:
for name, proc in self.model.pipeline:
stream2 = proc.pipe(stream2)
The SpanMarker model in this pipeline performs inference on each doc in the stream as if it were a single sentence.
def pipe(self, stream, batch_size=128):
"""Fill `doc.ents` and `span.label_` using the chosen SpanMarker model."""
if isinstance(stream, str):
stream = [stream]
if not isinstance(stream, types.GeneratorType):
stream = self.nlp.pipe(stream, batch_size=batch_size)
for docs in minibatch(stream, size=batch_size):
inputs = [[token.text if not token.is_space else "" for token in doc] for doc in docs]
# use document-level context in the inference if the model was also trained that way
if self.model.config.trained_with_document_context:
inputs = self.convert_inputs_to_dataset(inputs)
entities_list = self.model.predict(inputs, batch_size=self.batch_size)
for doc, entities in zip(docs, entities_list):
ents = []
for entity in entities:
start = entity["word_start_index"]
end = entity["word_end_index"]
span = doc[start:end]
span.label_ = entity["label"]
ents.append(span)
self.set_ents(doc, ents)
yield doc
So it reaches max sequence length pretty quickly and only annotates the first part of each document.
This is different to the behaviour I expected, where call() will break the doc down into sentences and infer each sentence individually.
I'm just trying out the pretrained models accompanying this repo via HF Spaces and I'm seeing some weird results.
Am I using the pretrained models wrong? Is it expecting different kinds of inputs?
I can elaborate and test more, but I figured I'd post this first. 😊
Thanks!
Hey @tomaarsen ,
Thanks for sharing this amazing repo, while working on https://colab.research.google.com/drive/1XoAjmgyx_Sgj4WuY7w7fLR_ItBm7-Dzu?usp=sharing
I was not able to embed tags using the following dataset: https://huggingface.co/datasets/nlpaueb/finer-139
Love to hear how to improve on it.
Looking forward to hearing from you.
Thanks,
Andy
Hi. I would like to mix up both the Span Marker model with Spacy integration(https://spacy.io/universe/project/span_marker) and the entity ruler (to add some custom patterns), similar like in the code from below.
`import spacy
import de_dep_news_trf
nlp = de_dep_news_trf.load()
patterns = [{"label": "PERIOD", "pattern": [{"LOWER": "monat"}]},
{"label": "PER", "pattern": "Raluca"},
{"label": "COLOR", "pattern": [{"LOWER": "blau"}]},
{"label": "JOBTITLE", "pattern": [{"LOWER" : {"REGEX": ".(referent)."}}]}
]
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
span_marker_ruler = nlp.add_pipe('span_marker', config={"model": "tomaarsen/span-marker-mbert-base-multinerd"}, name='span_marker_ruler')
doc = nlp("Raluca ist referent, er mag die Farbe Blau und geht jeden Monat in die Berge.")
print([(ent.text, ent.label_) for ent in doc.ents])`
Unfortunately, this mix doesn't work properly.
I would expect however an output like this
`
[('Raluca', 'PER'), ('referent', 'JOBTITLE'), ('Blau', 'COLOR'), ('Monat', 'PERIOD')]
`
How to integrate entity ruler (with add of some customized entities and patterns) with the Span Marker model?
Do you have any ideas on how to solve this kind of issues?
I think it might not be best practice to completely overwrite the previously obtained entities, maybe something like the code underneath would work better.
from spacy.util import filter_spans
doc.set_ents(filter_spans(list(doc.ents) + new_ents))
Hi there. Thanks for the great library!
I have one issue regarding the usage of Bert-based models. I trained different models finetuning them on my custom dataset (roberta, luke, deberta, xlm-roberta etc)
I tried to do the same using the same code but I get an error (also using your code from the getting started part of the documentation).
I am using a dataset with this format:
{"tokens": ["(7)", "On", "specific", "query", "by", "the", "Bench", "about", "an", "entry", "of", "Rs.", "1,31,37,500", "on", "deposit", "side", "of", "Hongkong", "Bank", "account", "of", "which", "a", "photo", "copy", "is", "appearing", "at", "p.", "40", "of", "assessee's", "paper", "book,", "learned", "authorised", "representative", "submitted", "that", "it", "was", "related", "to", "loan", "from", "broker,", "Rahul", "&", "Co.", "on", "the", "basis", "of", "his", "submission", "a", "necessary", "mark", "is", "put", "by", "us", "on", "that", "photo", "copy."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 21, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
And I load it with this script:
from datasets import load_dataset, Dataset, DatasetDict
def load_legal_ner():
ret = {}
for split_name in ['TRAIN', 'DEV']:
data = []
with open(f"./data/NER_{split_name}/NER_{split_name}_ALL_OT.jsonl", 'r') as reader:
for line in reader:
data.append(json.loads(line))
ret[split_name.lower()] = Dataset.from_list(data)
return DatasetDict(ret)
For every other model, it works perfectly. But if I try to use a bert-based model (e.g. bert-base-uncased, bert-base-cased, legal-bert etc) it crashes returning different errors, but always linked to the forward method (sometimes is related to the normalization layer, sometimes about matmul).
This is the traceback:
Cell In[8], line 28
20 trainer = Trainer(
21 model=model,
22 args=args,
23 train_dataset=dataset["train"],
24 eval_dataset=dataset["dev"],
25 )
27 # Training is really simple using our Trainer!
---> 28 trainer.train()
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1535 hf_hub_utils.enable_progress_bars()
1536 else:
-> 1537 return inner_training_loop(
1538 args=args,
1539 resume_from_checkpoint=resume_from_checkpoint,
1540 trial=trial,
1541 ignore_keys_for_eval=ignore_keys_for_eval,
1542 )
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1851 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
1853 with self.accelerator.accumulate(model):
-> 1854 tr_loss_step = self.training_step(model, inputs)
1856 if (
1857 args.logging_nan_inf_filter
1858 and not is_torch_tpu_available()
1859 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1860 ):
1861 # if loss is nan or inf simply add the average of previous logged losses
1862 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2723, in Trainer.training_step(self, model, inputs)
2720 return loss_mb.reduce_mean().detach().to(self.args.device)
2722 with self.compute_loss_context_manager():
-> 2723 loss = self.compute_loss(model, inputs)
2725 if self.args.n_gpu > 1:
2726 loss = loss.mean() # mean() to average on multi-gpu parallel training
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2746, in Trainer.compute_loss(self, model, inputs, return_outputs)
2744 else:
2745 labels = None
-> 2746 outputs = model(**inputs)
2747 # Save past state if it exists
2748 # TODO: this needs to be fixed and made cleaner later.
2749 if self.args.past_index >= 0:
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
136 """Forward call of the SpanMarkerModel.
137
138 Args:
(...)
150 SpanMarkerOutput: The output dataclass.
151 """
152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
154 input_ids,
155 attention_mask=attention_mask,
156 token_type_ids=token_type_ids,
157 position_ids=position_ids,
158 )
159 last_hidden_state = outputs[0]
160 last_hidden_state = self.dropout(last_hidden_state)
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1013, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
1004 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
1006 embedding_output = self.embeddings(
1007 input_ids=input_ids,
1008 position_ids=position_ids,
(...)
1011 past_key_values_length=past_key_values_length,
1012 )
-> 1013 encoder_outputs = self.encoder(
1014 embedding_output,
1015 attention_mask=extended_attention_mask,
1016 head_mask=head_mask,
1017 encoder_hidden_states=encoder_hidden_states,
1018 encoder_attention_mask=encoder_extended_attention_mask,
1019 past_key_values=past_key_values,
1020 use_cache=use_cache,
1021 output_attentions=output_attentions,
1022 output_hidden_states=output_hidden_states,
1023 return_dict=return_dict,
1024 )
1025 sequence_output = encoder_outputs[0]
1026 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
596 layer_outputs = self._gradient_checkpointing_func(
597 layer_module.__call__,
598 hidden_states,
(...)
604 output_attentions,
605 )
606 else:
--> 607 layer_outputs = layer_module(
608 hidden_states,
609 attention_mask,
610 layer_head_mask,
611 encoder_hidden_states,
612 encoder_attention_mask,
613 past_key_value,
614 output_attentions,
615 )
617 hidden_states = layer_outputs[0]
618 if use_cache:
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:497, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
485 def forward(
486 self,
487 hidden_states: torch.Tensor,
(...)
494 ) -> Tuple[torch.Tensor]:
495 # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
496 self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
--> 497 self_attention_outputs = self.attention(
498 hidden_states,
499 attention_mask,
500 head_mask,
501 output_attentions=output_attentions,
502 past_key_value=self_attn_past_key_value,
503 )
504 attention_output = self_attention_outputs[0]
506 # if decoder, the last output is tuple of self-attn cache
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:436, in BertAttention.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
417 def forward(
418 self,
419 hidden_states: torch.Tensor,
(...)
425 output_attentions: Optional[bool] = False,
426 ) -> Tuple[torch.Tensor]:
427 self_outputs = self.self(
428 hidden_states,
429 attention_mask,
(...)
434 output_attentions,
435 )
--> 436 attention_output = self.output(self_outputs[0], hidden_states)
437 outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
438 return outputs
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:386, in BertSelfOutput.forward(self, hidden_states, input_tensor)
385 def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
--> 386 hidden_states = self.dense(hidden_states)
387 hidden_states = self.dropout(hidden_states)
388 hidden_states = self.LayerNorm(hidden_states + input_tensor)
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
113 def forward(self, input: Tensor) -> Tensor:
--> 114 return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 768 n 3072 k 768 mat1_ld 768 mat2_ld 768 result_ld 768 abcType 0 computeType 68 scaleType 0
Or this is another traceback (same code):
RuntimeError Traceback (most recent call last)
Cell In[16], line 148
139 trainer = Trainer(
140 model=model,
141 args=args,
(...)
144 compute_metrics=compute_f1
145 )
147 # Training is really simple using our Trainer!
--> 148 trainer.train()
150 # ... and so is evaluating!
151 metrics = trainer.evaluate()
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1535 hf_hub_utils.enable_progress_bars()
1536 else:
-> 1537 return inner_training_loop(
1538 args=args,
1539 resume_from_checkpoint=resume_from_checkpoint,
1540 trial=trial,
1541 ignore_keys_for_eval=ignore_keys_for_eval,
1542 )
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1851 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
1853 with self.accelerator.accumulate(model):
-> 1854 tr_loss_step = self.training_step(model, inputs)
1856 if (
1857 args.logging_nan_inf_filter
1858 and not is_torch_tpu_available()
1859 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1860 ):
1861 # if loss is nan or inf simply add the average of previous logged losses
1862 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2723, in Trainer.training_step(self, model, inputs)
2720 return loss_mb.reduce_mean().detach().to(self.args.device)
2722 with self.compute_loss_context_manager():
-> 2723 loss = self.compute_loss(model, inputs)
2725 if self.args.n_gpu > 1:
2726 loss = loss.mean() # mean() to average on multi-gpu parallel training
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2746, in Trainer.compute_loss(self, model, inputs, return_outputs)
2744 else:
2745 labels = None
-> 2746 outputs = model(**inputs)
2747 # Save past state if it exists
2748 # TODO: this needs to be fixed and made cleaner later.
2749 if self.args.past_index >= 0:
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
136 """Forward call of the SpanMarkerModel.
137
138 Args:
(...)
150 SpanMarkerOutput: The output dataclass.
151 """
152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
154 input_ids,
155 attention_mask=attention_mask,
156 token_type_ids=token_type_ids,
157 position_ids=position_ids,
158 )
159 last_hidden_state = outputs[0]
160 last_hidden_state = self.dropout(last_hidden_state)
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1013, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
1004 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
1006 embedding_output = self.embeddings(
1007 input_ids=input_ids,
1008 position_ids=position_ids,
(...)
1011 past_key_values_length=past_key_values_length,
1012 )
-> 1013 encoder_outputs = self.encoder(
1014 embedding_output,
1015 attention_mask=extended_attention_mask,
1016 head_mask=head_mask,
1017 encoder_hidden_states=encoder_hidden_states,
1018 encoder_attention_mask=encoder_extended_attention_mask,
1019 past_key_values=past_key_values,
1020 use_cache=use_cache,
1021 output_attentions=output_attentions,
1022 output_hidden_states=output_hidden_states,
1023 return_dict=return_dict,
1024 )
1025 sequence_output = encoder_outputs[0]
1026 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
596 layer_outputs = self._gradient_checkpointing_func(
597 layer_module.__call__,
598 hidden_states,
(...)
604 output_attentions,
605 )
606 else:
--> 607 layer_outputs = layer_module(
608 hidden_states,
609 attention_mask,
610 layer_head_mask,
611 encoder_hidden_states,
612 encoder_attention_mask,
613 past_key_value,
614 output_attentions,
615 )
617 hidden_states = layer_outputs[0]
618 if use_cache:
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:539, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
536 cross_attn_present_key_value = cross_attention_outputs[-1]
537 present_key_value = present_key_value + cross_attn_present_key_value
--> 539 layer_output = apply_chunking_to_forward(
540 self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
541 )
542 outputs = (layer_output,) + outputs
544 # if decoder, return the attn key/values as the last output
File /opt/conda/lib/python3.10/site-packages/transformers/pytorch_utils.py:242, in apply_chunking_to_forward(forward_fn, chunk_size, chunk_dim, *input_tensors)
239 # concatenate output at same dimension
240 return torch.cat(output_chunks, dim=chunk_dim)
--> 242 return forward_fn(*input_tensors)
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:552, in BertLayer.feed_forward_chunk(self, attention_output)
550 def feed_forward_chunk(self, attention_output):
551 intermediate_output = self.intermediate(attention_output)
--> 552 layer_output = self.output(intermediate_output, attention_output)
553 return layer_output
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:466, in BertOutput.forward(self, hidden_states, input_tensor)
464 hidden_states = self.dense(hidden_states)
465 hidden_states = self.dropout(hidden_states)
--> 466 hidden_states = self.LayerNorm(hidden_states + input_tensor)
467 return hidden_states
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/normalization.py:190, in LayerNorm.forward(self, input)
189 def forward(self, input: Tensor) -> Tensor:
--> 190 return F.layer_norm(
191 input, self.normalized_shape, self.weight, self.bias, self.eps)
File /opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:2515, in layer_norm(input, normalized_shape, weight, bias, eps)
2511 if has_torch_function_variadic(input, weight, bias):
2512 return handle_torch_function(
2513 layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
2514 )
-> 2515 return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions
Sometimes also this one:
/usr/local/src/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
I know that probably is not much to work on. Let me know if you have any advice for me.
transformers==4.36.0
span-marker==1.5.0
torch==2.0.0
In trainer.py
, there are three .map
functions where num_proc
is not specified.
It should be possible to set this because it speeds up tokenization, spreading and normalizations by a significant amount.
Example from trainer.py
row 227:
with tokenizer.entity_tracker(split=dataset_name):
dataset = dataset.map(
tokenizer,
batched=True,
remove_columns=set(dataset.column_names) - set(self.OPTIONAL_COLUMNS),
desc=f"Tokenizing the {dataset_name} dataset",
fn_kwargs={"return_num_words": is_evaluate},
num_proc=4, # Added this - should be specifiable
)
This sped up tokenization by about 4 times.
Hi Tom,
Sorry for missing your email address. I noticed that you get slightly worse results on ontonotes dataset.
There are several versions of ontonotes. In PL-Marker, I preprocess CONLL-2012 by https://drive.google.com/file/d/1cFVb5thXNCXZkz99E8w4Xg9JB9E_viAa/view?usp=sharing
Best,
Deming
When inputs are of same, predict should return those many lists. Instead gives 1.
inputs = ["Unknown", "Unknown", "Unknown"]
model = SpanMarkerModel.from_pretrained("model_name")
model.predict(inputs)
Output:
[]
Expected Output:
[[], [], []]
ValueError Traceback (most recent call last)
in <cell line: 6>()
4 # Load the spaCy model
5 nlp = spacy.load("en_core_web_sm")
----> 6 nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})
7
8 # Feed some text through the model to get a spacy Doc
1 frames
/usr/local/lib/python3.10/dist-packages/spacy/language.py in create_pipe(self, factory_name, name, config, raw_config, validate)
658 lang_code=self.lang,
659 )
--> 660 raise ValueError(err)
661 pipe_meta = self.get_factory_meta(factory_name)
662 # This is unideal, but the alternative would mean you always need to
ValueError: [E002] Can't find factory for 'span_marker' for language English (en). This usually happens when spaCy calls nlp.create_pipe
with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component
(for function components) or @Language.factory
(for class components).
Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, spancat_singlelabel, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer
Package versions:
spacy == 3.5.3 (did not work with 3.5.2 as well)
span_marker == 1.2.1
Is there a way to use https://pypi.org/project/nervaluate/ instead of the current metric library?
I would like to obtain F1-exact, F1-strict etc,
Hi,
Appreciate the amazing work. Is there a list of entity to be detected with the pretrained model?
Thanks.
I ran the README snippet in a Kaggle notebook (Python 3.7) after running %pip install datasets span_marker transformers
. I get a KeyError
from trainer.train()
/ Trainer.preprocess_dataset
(see below).
I was able to fix this by upgrading to datasets>=2.6.0
. It seems that with the default Kaggle setup datasets==2.1.0
was installed when I ran %pip install datasets span_marker transformers
.
I can't see an obvious reason for the change in datasets
behaviour in the 2.6.0 release notes so maybe I'm on the wrong track. I thought you might like to know, in case there is a reason and you would like to add a constraint on datasets
as a dependency.
Thanks for the package!
Traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_103/2689077115.py in <module>
30 )
31
---> 32 trainer.train()
33 trainer.save_model("my_span_marker_model/checkpoint-final")
34
/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1635 resume_from_checkpoint=resume_from_checkpoint,
1636 trial=trial,
-> 1637 ignore_keys_for_eval=ignore_keys_for_eval,
1638 )
1639
/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1643 self._train_batch_size = batch_size
1644 # Data loader and number of training steps
-> 1645 train_dataloader = self.get_train_dataloader()
1646
1647 # Setting up training control variables:
/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in get_train_dataloader(self)
176 def get_train_dataloader(self) -> DataLoader:
177 """Return the preprocessed training DataLoader."""
--> 178 self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
179 return super().get_train_dataloader()
180
/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in preprocess_dataset(self, dataset, label_normalizer, tokenizer, dataset_name, is_evaluate)
170 batched=True,
171 remove_columns=dataset.column_names,
--> 172 desc=f"Tokenizing the {dataset_name} dataset",
173 )
174 return dataset
/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
1971 new_fingerprint=new_fingerprint,
1972 disable_tqdm=disable_tqdm,
-> 1973 desc=desc,
1974 )
1975 else:
/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
518 self: "Dataset" = kwargs.pop("self")
519 # apply actual function
--> 520 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
521 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
522 for dataset in datasets:
/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
485 }
486 # apply actual function
--> 487 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
488 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
489 # re-apply format to the output
/opt/conda/lib/python3.7/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
456 # Call actual function
457
--> 458 out = func(self, *args, **kwargs)
459
460 # Update fingerprint of in-place transforms + update in-place history of transforms
/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
2341 indices,
2342 check_same_num_examples=len(input_dataset.list_indexes()) > 0,
-> 2343 offset=offset,
2344 )
2345 except NumExamplesMismatchError:
/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in apply_function_on_filtered_inputs(inputs, indices, check_same_num_examples, offset)
2218 if with_rank:
2219 additional_args += (rank,)
-> 2220 processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
2221 if update_data is None:
2222 # Check if the function returns updated examples
/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in decorated(item, *args, **kwargs)
1913 )
1914 # Use the LazyDict internally, while mapping the function
-> 1915 result = f(decorated_item, *args, **kwargs)
1916 # Return a standard dict
1917 return result.data if isinstance(result, LazyDict) else result
/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in <lambda>(batch)
167 # Tokenize and add start/end markers
168 dataset = dataset.map(
--> 169 lambda batch: tokenizer(batch["tokens"], labels=batch["ner_tags"], return_num_words=is_evaluate),
170 batched=True,
171 remove_columns=dataset.column_names,
/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in __getitem__(self, key)
123 class Batch(LazyDict):
124 def __getitem__(self, key):
--> 125 values = super().__getitem__(key)
126 if self.features and key in self.features:
127 values = [
/opt/conda/lib/python3.7/collections/__init__.py in __getitem__(self, key)
1025 if hasattr(self.__class__, "__missing__"):
1026 return self.__class__.__missing__(self, key)
-> 1027 raise KeyError(key)
1028 def __setitem__(self, key, item): self.data[key] = item
1029 def __delitem__(self, key): del self.data[key]
KeyError: 'tokens'
See title. This adding contextual information occurs every time the development set is used for a mid-training evaluation. This is a bit of a waste of (training) time.
Training CoNLL03 with doc-level context should reproduce it.
HI,
You did a very good job. I tried SpanMarkerNer with a pretrained model "bert-base-multilingual-cased" and I had good results. But now I want to use my code with mobileBert.
Do you know if it's possible ?
And how to do with https://tfhub.dev/tensorflow/mobilebert_multi_cased_L-24_H-128_B-512_A-4_F-4_OPT/1 ( for multilangual) ?
Thank you
Hi @tomaarsen! Is there a ONNX exporter planned? Have you tried using SpanMarker with ONNX models for inference?
Would be really curious if you experimented with that already! :-)
Hello!
This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:
# ✅
model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
# ❌
model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")
# You can also supply a list of words directly: ✅
model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])
This is a consequence of the RoBERTa tokenizer distinguishing ,
and ,
as different tokens, and the SpanMarker model is only familiar with the ,
variant.
Another alternative is to use the spaCy integration, which preprocesses the text into words for you!
The (m)BERT-based SpanMarker models do not require this preprocessing.
I have used gte-tiny embeddings for my custom NER model and need to speed up the inference time.
below are stats for different batch sizes.
Batch Size | Average Inference Time (ms)- GPU | Average Inference Time (ms)- CPU |
---|---|---|
16 | 0.14945 | 1.23388 |
32 | 0.28 | 3.24456 |
64 | 0.51582 | 6.57234 |
128 | 1.10669 | 13.73319 |
256 | 2.24729 | 28.236 |
Is there any specific method to enhance it? @tomaarsen
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.