tomaarsen / spanmarkerner Goto Github PK

View Code? Open in Web Editor NEW

354.0 9.0 24.0 82.86 MB

SpanMarker for Named Entity Recognition

Home Page: https://tomaarsen.github.io/SpanMarkerNER/

License: Apache License 2.0

Python 31.27% Jupyter Notebook 68.49% Makefile 0.11% Batchfile 0.13%

ner nlp transformers huggingface spacy spacy-extension

spanmarkerner's People

Contributors

Stargazers

Watchers

spanmarkerner's Issues

SpanMarker library for document level context Gives Error. (RuntimeError: CUDA error: device-side assert triggered)

Gives this error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

FineTune Code:

from datasets import load_dataset, Dataset
dataset = load_dataset("json", data_files=["output.jsonl"])
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained(
    "bert-base-uncased",  # Example encoder
    labels=['O','Degree','Years_of_Experience','Email_Address'
        'College_Name','Location','Designation','Graduation_Year','Skills','Name'
        'Companies_worked_at'],
    max_prev_context=2,
    max_next_context=2,
)
from transformers import TrainingArguments
args = TrainingArguments(
    output_dir="models/RUDYRDX-NER-1",
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    push_to_hub=False,
    logging_steps=50,
    fp16=True,
    warmup_ratio=0.1,
)
from span_marker import Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
)
trainer.train() # error happens when this runs

Dataset Sample:

{"document_id": 0, "sentence_id": 0, "tokens": ["Govardhana", "K", "Senior", "Software", "Engineer", "Bengaluru", "Karnataka", "Karnataka", "-", "Email", "Indeed", ":", "indeed.com/r/Govardhana-K/", "b2de315d95905b68", "Total", "experience", "5", "Years", "6", "Months", "Cloud", "Lending", "Solutions", "INC", "4", "Month", "Salesforce", "Developer", "Oracle", "5", "Years", "2", "Month", "Core", "Java", "Developer", "Languages", "Core", "Java", "Go", "Lang", "Oracle", "PL-SQL", "programming", "Sales", "Force", "Developer", "APEX", "."], "ner_tags": ["Name", "Designation", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"document_id": 0, "sentence_id": 1, "tokens": ["Designations", "&", "Promotions", "Willing", "relocate", ":", "Anywhere", "WORK", "EXPERIENCE", "Senior", "Software", "Engineer", "Cloud", "Lending", "Solutions", "-", "Bangalore", "Karnataka", "-", "January", "2018", "Present", "Present", "Senior", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2016", "December", "2017", "Staff", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "January", "2014", "October", "2016", "Associate", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2012", "December", "2013", "EDUCATION", "B.E", "Computer", "Science", "Engineering", "Adithya", "Institute", "Technology", "-", "Tamil", "Nadu", "September", "2008", "June", "2012", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "SKILLS", "APEX", "."], "ner_tags": ["Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation"]}
{"document_id": 0, "sentence_id": 2, "tokens": ["(", "Less", "1", "year", ")", "Data", "Structures", "(", "3", "years", ")", "FLEXCUBE", "(", "5", "years", ")", "Oracle", "(", "5", "years", ")", "Algorithms", "(", "3", "years", ")", "LINKS", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/", "ADDITIONAL", "INFORMATION", "Technical", "Proficiency", ":", "Languages", ":", "Core", "Java", "Go", "Lang", "Data", "Structures", "&", "Algorithms", "Oracle", "PL-SQL", "programming", "Sales", "Force", "APEX", "."], "ner_tags": ["Name", "Name", "Name", "Designation", "Designation", "Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O"]}
{"document_id": 0, "sentence_id": 3, "tokens": ["Tools", ":", "RADTool", "Jdeveloper", "NetBeans", "Eclipse", "SQL", "developer", "PL/SQL", "Developer", "WinSCP", "Putty", "Web", "Technologies", ":", "JavaScript", "XML", "HTML", "Webservice", "Operating", "Systems", ":", "Linux", "Windows", "Version", "control", "system", "SVN", "&", "Git-Hub", "Databases", ":", "Oracle", "Middleware", ":", "Web", "logic", "OC4J", "Product", "FLEXCUBE", ":", "Oracle", "FLEXCUBE", "Versions", "10.x", "11.x", "12.x", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/"], "ner_tags": ["Name", "Name", "Designation", "Designation", "Designation", "Location", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O"]}

implementing the relation identification part?

Hi,

Do you have any clues about implementing the relation identifying part, which follows the entity recognition?

I am looking into the sources code and thinking about implementing that maybe. It could be useful if you have any suggestions of doing that.

Thanks

How to make this work for overlapping entities?

Hi tom, i was wondering how could i make this work for overlapping spans for different entity types and how to extend this to relations extraction as well.
Any help or direction would be super helpful.

amazing works but minimum token ?

Hello
it is really an amazing work, but I wonder is there any minimum number of token required?
when I tried inference on a short sentence or question, it just returned an empty JSON.

is <start> <end> ever used?

class SpanMarkerTokenizer:
def init(self, tokenizer: PreTrainedTokenizer, config: SpanMarkerConfig, **kwargs) -> None:
self.tokenizer = tokenizer
self.config = config

    tokenizer.add_tokens(["<start>", "<end>"], special_tokens=True)
    self.start_marker_id, self.end_marker_id = self.tokenizer.convert_tokens_to_ids(["<start>", "<end>"])

Hi,

In the above code, you defined and for tokenizer but I couldn't find where they are used. Potentially intended for relation extraction?

Choose class-candidates during inference

Really great library @tomaarsen, thank you for this great contribution!!

It would be really convenient to have the possibility to give a list of class-candidates to the predict method during inference. This would be useful for Retriever-Reader systems, where the Retriever (e.g. Setfit-Model) returns text sequences where it is already known what set of classes are available for the Reader (e.g. SpanMarker) for extraction and you do not want to extract other classes.

E.g. a system like this from here https://lilianweng.github.io/posts/2020-10-29-odqa/:

I was thinking about modifications to the predict method like these:

def predict(self, ... , class_candidates: Optional[List[str]] = None):
    
    ...

    if class_candidates is not None:
        # convert class names to class ids
        label2id = self.config.label2id
        class_candidate_ids = [label2id[c] for c in class_candidates if c in label2id]

    for batch_start_idx in trange(0, len(dataset), batch_size, leave=True, disable=not show_progress_bar):
        
        ...
        # Computing probabilities based on the logits
        probs = output.logits.softmax(-1)

        # Mask everything except class-candidate probabilities
        if class_candidates is not None:
            mask = torch.zeros_like(probs)
            mask[:, :, class_candidate_ids] = 1
            probs = probs * mask

        # Get the labels and the correponding probability scores
        scores, labels = probs.max(-1)

        ...

    return all_entities

I did not find time to have a deep dive, implement & test it, but I think this could be a useful feature.

Possible to load your own trained models with internet disabled?

Was wondering if there is a way to load a model in a kaggle notebook that I trained myself. There's currently a NER competition going on, and I wanted to try using the SpanMarker library to compete. Training went fine, but now to submit, I need to have the kaggle notebook have internet disabled. When trying to load my checkpoint, I get this error:

model_checkpoint = "/kaggle/input/pii-train-1-cp3000/Kaggle Checkpoints/checkpoint 3000"
model = SpanMarkerModel.from_pretrained(model_checkpoint,local_files_only = True,
labels = [
'1-EMAIL', '1-ID_NUM', '1-NAME_STUDENT', '1-PHONE_NUM', '1-STREET_ADDRESS',
'1-URL_PERSONAL', '1-USERNAME', '2-ID_NUM', '2-NAME_STUDENT', '2-PHONE_NUM',
'2-STREET_ADDRESS', '2-URL_PERSONAL', 'O'
])

OSError: We couldn't connect to 'https://huggingface.co/' to load this file, couldn't find it in the cached files and it looks like bert-base-uncased is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Kaggle notebook here: https://www.kaggle.com/jdonnelly0804/pii-infer

Cannot train BILOU scheme with no singletons

The AutoLabelNormalizer infers the scheme based on the presence of all the tag prefixes - i.e. BILOU is assumed if there's at least one of each of ['B','I','L','O','U']. There doesn't appear to be any way of passing a specific LabelNormalizer to the trainer.

My issue with this is that my dataset contains only BILO - i.e., there are no singletons. But the nature of my problem means I need to use a scheme that has B and L tags.

Because my dataset has no U tags SpanMarker errors since the set BILO doesn't match any of cases that AutoLabelNormalizer checks, i.e. perhaps allow a LabelNormalizer to be passed as an argument or relax the conditions i..e

    if (tags == set("BILOU")) or (tags == set("BILO")):
        return LabelNormalizerBILOU(config)

Error loading SpanMarkerTokenizer

Hello Tom.

I have found a problem when loading the tokenizer directly from the SpanMarkerTokenizer class.
I have tested it with several repo ids and the response is the same in all of them.

System Info

Platform: MacOS Sonoma 14.0, M1 Pro
Python 3.11.5

transformers=4.35.0
span_marker=1.5.0
tokenizers=0.14.1

Error Response

tokenizer = SpanMarkerTokenizer.from_pretrained("tomaarsen/span-marker-mbert-base-multinerd")
Downloading (…)okenizer_config.json: 100%|███| 343/343 [00:00<00:00, 834kB/s]
Downloading (…)solve/main/vocab.txt: 100%|█| 996k/996k [00:00<00:00, 3.36MB/s
Downloading (…)/main/tokenizer.json: 100%|█| 2.92M/2.92M [00:00<00:00, 5.46MB
Downloading (…)in/added_tokens.json: 100%|█| 43.0/43.0 [00:00<00:00, 114kB/s]
Downloading (…)cial_tokens_map.json: 100%|███| 125/125 [00:00<00:00, 384kB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/polodealvarado/Desktop/github_projects/SpanMarkerNER/span_marker/tokenizer.py", line 285, in from_pretrained
    return cls(tokenizer, config=config, **kwargs)
  File "/Users/polodealvarado/Desktop/github_projects/SpanMarkerNER/span_marker/tokenizer.py", line 156, in __init__
    self.tokenizer.model_max_length, self.config.model_max_length or self.config.model_max_length_default
AttributeError: 'NoneType' object has no attribute 'model_max_length'

Expected behaviour

Get the tokenizer

Spacy Integration - "detect an empty sentence"

Using your spacy integration the senticizer for Spacy will sometimes produce an empty sentence (using "en_core_web_sm"). These leads to the SpanMarkerTokenizer throwing an exception. Not sure how active this project is any more, but these seems like an easy fix. Is there a work-around already for this? Would you like the code updated to have one (I might be able to do this fix).

ValueError: Failed to concatenate on axis=1 because tables don't have the same number of rows

When I place a single word in the first index of the list, it leads to the above error. However, if I put it in any index other than the first, no error occurs.

from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
entities = model.predict( ['Avolon', 'Walmart - Milwaukee, WI']) #error
entities = model.predict( [ 'Walmart - Milwaukee, WI','Avolon']) #no error
Can you please help me here @tomaarsen

Can use with BetterTransformer?

I was hoping this would work for optimizing the underlying model.encoder since it is independent(?) of the rest
but I'm getting a shape error like :
RuntimeError: shape '[1, 512]' is invalid for input of size 262144 basically saying it expected [512, 512] for attention_mask which is weird because attention_mask input is shaped [512, 512]

Here is the test code:

from span_marker import SpanMarkerModel
from optimum.bettertransformer import BetterTransformer
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super").eval()

# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
print(entities) # works

better_encoder = BetterTransformer.transform(model.encoder)
model.encoder=better_encoder 

# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
print(entities)

│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/span_marker/modeling.py:137 in forward      │
│                                                                                                  │
│   134 │   │   │   SpanMarkerOutput: The output dataclass.                                        │
│   135 │   │   """                                                                                │
│   136 │   │   token_type_ids = torch.zeros_like(input_ids)                                       │
│ ❱ 137 │   │   outputs = self.encoder(                                                            │
│   138 │   │   │   input_ids,                                                                     │
│   139 │   │   │   attention_mask=attention_mask,                                                 │
│   140 │   │   │   token_type_ids=token_type_ids,                                                 │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in          │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1 │
│ 020 in forward                                                                                   │
│                                                                                                  │
│   1017 │   │   │   inputs_embeds=inputs_embeds,                                                  │
│   1018 │   │   │   past_key_values_length=past_key_values_length,                                │
│   1019 │   │   )                                                                                 │
│ ❱ 1020 │   │   encoder_outputs = self.encoder(                                                   │
│   1021 │   │   │   embedding_output,                                                             │
│   1022 │   │   │   attention_mask=extended_attention_mask,                                       │
│   1023 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in          │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:6 │
│ 10 in forward                                                                                    │
│                                                                                                  │
│    607 │   │   │   │   │   encoder_attention_mask,                                               │
│    608 │   │   │   │   )                                                                         │
│    609 │   │   │   else:                                                                         │
│ ❱  610 │   │   │   │   layer_outputs = layer_module(                                             │
│    611 │   │   │   │   │   hidden_states,                                                        │
│    612 │   │   │   │   │   attention_mask,                                                       │
│    613 │   │   │   │   │   layer_head_mask,                                                      │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in          │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/optimum/bettertransformer/models/encoder_mo │
│ dels.py:246 in forward                                                                           │
│                                                                                                  │
│    243 │   │   │   # attention mask comes in with values 0 and -inf. we convert to torch.nn.Tra  │
│    244 │   │   │   # 0->false->keep this token -inf->true->mask this token                       │
│    245 │   │   │   attention_mask = attention_mask.bool()                                        │
│ ❱  246 │   │   │   attention_mask = torch.reshape(attention_mask, (attention_mask.shape[0], att  │
│    247 │   │   │   hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mas  │
│    248 │   │   │   attention_mask = None                                                         │
│    249                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[1, 512]' is invalid for input of size 262144

transformers version: 4.29.2
Python version: 3.9.16
PyTorch version (GPU?): 2.0.1 (False)
optimum :1.8.6

spaCy integration has no `.pipe()` method, hence will fallback to individual `.call()`

Not sure what works better during inference (individual sentences or longer segments in larger batches, but maybe something like this could work.

    def pipe(self, stream, batch_size=128, include_sent=None):
        """
        predict the class for a spacy Doc stream

        Args:
            stream (Doc): a spacy doc

        Returns:
            Doc: spacy doc with spanmarker entities
        """
        if isinstance(stream, str):
            stream = [stream]

        if not isinstance(stream, types.GeneratorType):
            stream = self.nlp.pipe(stream, batch_size=batch_size)

        for docs in util.minibatch(stream, size=batch_size):
            batch_results = self.model.predict(docs)

            for doc, prediction in zip(docs, batch_results):
                yield self.post_process_batch(doc, prediction)

Confusing error thrown when tokens is empty

When one of the elements in the training set is empty, then it ends up throwing a confusing error:

Label normalizing the train dataset: 100%|██████████████████████████████████████████████████████████████████████| 8324/8324 [00:00<00:00, 34016.14 examples/s]
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1665.71 examples/s]c:\code\span-marker-ner\span_marker\tokenizer.py:204: RuntimeWarning: All-NaN slice encountered
  num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1612.60 examples/s] 
This SpanMarker model will ignore 3.181189% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 5 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 18798 total entities:
- 203 missed entities with 6 words (1.079902%)
- 81 missed entities with 7 words (0.430897%)
- 58 missed entities with 8 words (0.308543%)
- 29 missed entities with 9 words (0.154272%)
- 5 missed entities with 10 words (0.026599%)
- 9 missed entities with 11 words (0.047877%)
- 8 missed entities with 12 words (0.042558%)
- 1 missed entities with 13 words (0.005320%)
- 1 missed entities with 14 words (0.005320%)
- 1 missed entities with 15 words (0.005320%)
- 2 missed entities with 16 words (0.010639%)
- 1 missed entities with 17 words (0.005320%)
Additionally, a total of 199 (1.058623%) entities were missed due to the maximum input length.
Traceback (most recent call last):
  File "c:\code\span-marker-ner\demo_conll2002.py", line 83, in <module>
    main()
  File "c:\code\span-marker-ner\demo_conll2002.py", line 72, in main
    trainer.train()
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1553, in train
    return inner_training_loop(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1567, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 423, in get_train_dataloader
    self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 241, in preprocess_dataset
    dataset = dataset.map(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3097, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "c:\code\span-marker-ner\span_marker\tokenizer.py", line 204, in __call__
    num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer

Perhaps a cleaner error can be designed here.

Tom Aarsen

Hugging Face Space URL not working for FewNERD fine-tuned model

Hi here @tomaarsen!

Description

It seems that the Hugging Face Space linked as 🤗 Space is not working as it throws an HTTP 404 error code, and it seems is no longer in your Hugging Face Hub account.

So I just wanted to point that out in case you want to have a running example besides the Free Inference API, otherwise I guess you're good removing it!

Pretrained tokenizer

Hey Tom, thank you for this amazing framework!

May I ask why you do not require any pre-tokenization (e.g., subword-tokenization, setting special tokens to -100) based on the chosen model using AutoTokenizer.from_pretrained()?
Might pre-tokenization improve or rather degrade model performance?

Best regards,
Daniel

spaCy_integration `.pipe()` does not behave as expected

I have created a pipeline like so:

self.model = spacy.load("en_core_web_md", disable=[
            "tagger",
            "lemmatizer",
            "attribute_ruler",
            "ner",])
self.model.add_pipe(
    "span_marker",
    config={"model": span_marker_model_path, "batch_size": batch_size},
)

I call pipe() on a stream of documents:

for name, proc in self.model.pipeline:
        stream2 = proc.pipe(stream2)

The SpanMarker model in this pipeline performs inference on each doc in the stream as if it were a single sentence.

    def pipe(self, stream, batch_size=128):
        """Fill `doc.ents` and `span.label_` using the chosen SpanMarker model."""
        if isinstance(stream, str):
            stream = [stream]

        if not isinstance(stream, types.GeneratorType):
            stream = self.nlp.pipe(stream, batch_size=batch_size)

        for docs in minibatch(stream, size=batch_size):
            inputs = [[token.text if not token.is_space else "" for token in doc] for doc in docs]

            # use document-level context in the inference if the model was also trained that way
            if self.model.config.trained_with_document_context:
                inputs = self.convert_inputs_to_dataset(inputs)

            entities_list = self.model.predict(inputs, batch_size=self.batch_size)
            for doc, entities in zip(docs, entities_list):
                ents = []
                for entity in entities:
                    start = entity["word_start_index"]
                    end = entity["word_end_index"]
                    span = doc[start:end]
                    span.label_ = entity["label"]
                    ents.append(span)

                self.set_ents(doc, ents)

                yield doc

So it reaches max sequence length pretty quickly and only annotates the first part of each document.

This is different to the behaviour I expected, where call() will break the doc down into sentences and infer each sentence individually.

Unexpectedly (bad) predictions?

I'm just trying out the pretrained models accompanying this repo via HF Spaces and I'm seeing some weird results.

For some models there's a huge difference in quality when I include/exclude a period. E.g. the difference between the two sentences "This is James" vs "This is James**.**" sometimes causes models to not be able to recognise "James" as a person.
The multilingual model was not able to recognise major cities in a straightforward Dutch sentence ("Ik woon in Leuven." - also tried it with "Amsterdam", and "Parijs").

Am I using the pretrained models wrong? Is it expecting different kinds of inputs?

I can elaborate and test more, but I figured I'd post this first. 😊

Thanks!

SpanMaker not working on custom dataset

Hey @tomaarsen ,
Thanks for sharing this amazing repo, while working on https://colab.research.google.com/drive/1XoAjmgyx_Sgj4WuY7w7fLR_ItBm7-Dzu?usp=sharing

I was not able to embed tags using the following dataset: https://huggingface.co/datasets/nlpaueb/finer-139

Love to hear how to improve on it.
Looking forward to hearing from you.
Thanks,
Andy

Integrate Entity Ruler with Span Marker model

Hi. I would like to mix up both the Span Marker model with Spacy integration(https://spacy.io/universe/project/span_marker) and the entity ruler (to add some custom patterns), similar like in the code from below.
`import spacy
import de_dep_news_trf

nlp = de_dep_news_trf.load()

patterns = [{"label": "PERIOD", "pattern": [{"LOWER": "monat"}]},
{"label": "PER", "pattern": "Raluca"},
{"label": "COLOR", "pattern": [{"LOWER": "blau"}]},
{"label": "JOBTITLE", "pattern": [{"LOWER" : {"REGEX": ".(referent)."}}]}
]

ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

span_marker_ruler = nlp.add_pipe('span_marker', config={"model": "tomaarsen/span-marker-mbert-base-multinerd"}, name='span_marker_ruler')

doc = nlp("Raluca ist referent, er mag die Farbe Blau und geht jeden Monat in die Berge.")

print([(ent.text, ent.label_) for ent in doc.ents])`

Unfortunately, this mix doesn't work properly.

I would expect however an output like this

`
[('Raluca', 'PER'), ('referent', 'JOBTITLE'), ('Blau', 'COLOR'), ('Monat', 'PERIOD')]

How to integrate entity ruler (with add of some customized entities and patterns) with the Span Marker model?

Do you have any ideas on how to solve this kind of issues?

spaCy integration ignores old entities

I think it might not be best practice to completely overwrite the previously obtained entities, maybe something like the code underneath would work better.

from spacy.util import filter_spans

doc.set_ents(filter_spans(list(doc.ents) + new_ents))

Bert-based models crash

Hi there. Thanks for the great library!

I have one issue regarding the usage of Bert-based models. I trained different models finetuning them on my custom dataset (roberta, luke, deberta, xlm-roberta etc)

I tried to do the same using the same code but I get an error (also using your code from the getting started part of the documentation).

I am using a dataset with this format:
{"tokens": ["(7)", "On", "specific", "query", "by", "the", "Bench", "about", "an", "entry", "of", "Rs.", "1,31,37,500", "on", "deposit", "side", "of", "Hongkong", "Bank", "account", "of", "which", "a", "photo", "copy", "is", "appearing", "at", "p.", "40", "of", "assessee's", "paper", "book,", "learned", "authorised", "representative", "submitted", "that", "it", "was", "related", "to", "loan", "from", "broker,", "Rahul", "&", "Co.", "on", "the", "basis", "of", "his", "submission", "a", "necessary", "mark", "is", "put", "by", "us", "on", "that", "photo", "copy."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 21, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

And I load it with this script:

from datasets import load_dataset, Dataset, DatasetDict
def load_legal_ner():
    ret = {}
    for split_name in ['TRAIN', 'DEV']:
        data = []
        with open(f"./data/NER_{split_name}/NER_{split_name}_ALL_OT.jsonl", 'r') as reader:
            for line in reader:
                data.append(json.loads(line))
        ret[split_name.lower()] = Dataset.from_list(data)
    return DatasetDict(ret)

For every other model, it works perfectly. But if I try to use a bert-based model (e.g. bert-base-uncased, bert-base-cased, legal-bert etc) it crashes returning different errors, but always linked to the forward method (sometimes is related to the normalization layer, sometimes about matmul).

This is the traceback:

Cell In[8], line 28
     20 trainer = Trainer(
     21     model=model,
     22     args=args,
     23     train_dataset=dataset["train"],
     24     eval_dataset=dataset["dev"],
     25 )
     27 # Training is really simple using our Trainer!
---> 28 trainer.train()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2723, in Trainer.training_step(self, model, inputs)
   2720     return loss_mb.reduce_mean().detach().to(self.args.device)
   2722 with self.compute_loss_context_manager():
-> 2723     loss = self.compute_loss(model, inputs)
   2725 if self.args.n_gpu > 1:
   2726     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2746, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2744 else:
   2745     labels = None
-> 2746 outputs = model(**inputs)
   2747 # Save past state if it exists
   2748 # TODO: this needs to be fixed and made cleaner later.
   2749 if self.args.past_index >= 0:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
    136 """Forward call of the SpanMarkerModel.
    137 
    138 Args:
   (...)
    150     SpanMarkerOutput: The output dataclass.
    151 """
    152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
    154     input_ids,
    155     attention_mask=attention_mask,
    156     token_type_ids=token_type_ids,
    157     position_ids=position_ids,
    158 )
    159 last_hidden_state = outputs[0]
    160 last_hidden_state = self.dropout(last_hidden_state)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1013, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1004 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1006 embedding_output = self.embeddings(
   1007     input_ids=input_ids,
   1008     position_ids=position_ids,
   (...)
   1011     past_key_values_length=past_key_values_length,
   1012 )
-> 1013 encoder_outputs = self.encoder(
   1014     embedding_output,
   1015     attention_mask=extended_attention_mask,
   1016     head_mask=head_mask,
   1017     encoder_hidden_states=encoder_hidden_states,
   1018     encoder_attention_mask=encoder_extended_attention_mask,
   1019     past_key_values=past_key_values,
   1020     use_cache=use_cache,
   1021     output_attentions=output_attentions,
   1022     output_hidden_states=output_hidden_states,
   1023     return_dict=return_dict,
   1024 )
   1025 sequence_output = encoder_outputs[0]
   1026 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    596     layer_outputs = self._gradient_checkpointing_func(
    597         layer_module.__call__,
    598         hidden_states,
   (...)
    604         output_attentions,
    605     )
    606 else:
--> 607     layer_outputs = layer_module(
    608         hidden_states,
    609         attention_mask,
    610         layer_head_mask,
    611         encoder_hidden_states,
    612         encoder_attention_mask,
    613         past_key_value,
    614         output_attentions,
    615     )
    617 hidden_states = layer_outputs[0]
    618 if use_cache:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:497, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    485 def forward(
    486     self,
    487     hidden_states: torch.Tensor,
   (...)
    494 ) -> Tuple[torch.Tensor]:
    495     # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
    496     self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
--> 497     self_attention_outputs = self.attention(
    498         hidden_states,
    499         attention_mask,
    500         head_mask,
    501         output_attentions=output_attentions,
    502         past_key_value=self_attn_past_key_value,
    503     )
    504     attention_output = self_attention_outputs[0]
    506     # if decoder, the last output is tuple of self-attn cache

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:436, in BertAttention.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    417 def forward(
    418     self,
    419     hidden_states: torch.Tensor,
   (...)
    425     output_attentions: Optional[bool] = False,
    426 ) -> Tuple[torch.Tensor]:
    427     self_outputs = self.self(
    428         hidden_states,
    429         attention_mask,
   (...)
    434         output_attentions,
    435     )
--> 436     attention_output = self.output(self_outputs[0], hidden_states)
    437     outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
    438     return outputs

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:386, in BertSelfOutput.forward(self, hidden_states, input_tensor)
    385 def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
--> 386     hidden_states = self.dense(hidden_states)
    387     hidden_states = self.dropout(hidden_states)
    388     hidden_states = self.LayerNorm(hidden_states + input_tensor)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 768 n 3072 k 768 mat1_ld 768 mat2_ld 768 result_ld 768 abcType 0 computeType 68 scaleType 0

Or this is another traceback (same code):

RuntimeError                              Traceback (most recent call last)
Cell In[16], line 148
    139 trainer = Trainer(
    140     model=model,
    141     args=args,
   (...)
    144     compute_metrics=compute_f1
    145 )
    147 # Training is really simple using our Trainer!
--> 148 trainer.train()
    150 # ... and so is evaluating!
    151 metrics = trainer.evaluate()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2723, in Trainer.training_step(self, model, inputs)
   2720     return loss_mb.reduce_mean().detach().to(self.args.device)
   2722 with self.compute_loss_context_manager():
-> 2723     loss = self.compute_loss(model, inputs)
   2725 if self.args.n_gpu > 1:
   2726     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2746, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2744 else:
   2745     labels = None
-> 2746 outputs = model(**inputs)
   2747 # Save past state if it exists
   2748 # TODO: this needs to be fixed and made cleaner later.
   2749 if self.args.past_index >= 0:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
    136 """Forward call of the SpanMarkerModel.
    137 
    138 Args:
   (...)
    150     SpanMarkerOutput: The output dataclass.
    151 """
    152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
    154     input_ids,
    155     attention_mask=attention_mask,
    156     token_type_ids=token_type_ids,
    157     position_ids=position_ids,
    158 )
    159 last_hidden_state = outputs[0]
    160 last_hidden_state = self.dropout(last_hidden_state)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1013, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1004 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1006 embedding_output = self.embeddings(
   1007     input_ids=input_ids,
   1008     position_ids=position_ids,
   (...)
   1011     past_key_values_length=past_key_values_length,
   1012 )
-> 1013 encoder_outputs = self.encoder(
   1014     embedding_output,
   1015     attention_mask=extended_attention_mask,
   1016     head_mask=head_mask,
   1017     encoder_hidden_states=encoder_hidden_states,
   1018     encoder_attention_mask=encoder_extended_attention_mask,
   1019     past_key_values=past_key_values,
   1020     use_cache=use_cache,
   1021     output_attentions=output_attentions,
   1022     output_hidden_states=output_hidden_states,
   1023     return_dict=return_dict,
   1024 )
   1025 sequence_output = encoder_outputs[0]
   1026 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    596     layer_outputs = self._gradient_checkpointing_func(
    597         layer_module.__call__,
    598         hidden_states,
   (...)
    604         output_attentions,
    605     )
    606 else:
--> 607     layer_outputs = layer_module(
    608         hidden_states,
    609         attention_mask,
    610         layer_head_mask,
    611         encoder_hidden_states,
    612         encoder_attention_mask,
    613         past_key_value,
    614         output_attentions,
    615     )
    617 hidden_states = layer_outputs[0]
    618 if use_cache:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:539, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    536     cross_attn_present_key_value = cross_attention_outputs[-1]
    537     present_key_value = present_key_value + cross_attn_present_key_value
--> 539 layer_output = apply_chunking_to_forward(
    540     self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
    541 )
    542 outputs = (layer_output,) + outputs
    544 # if decoder, return the attn key/values as the last output

File /opt/conda/lib/python3.10/site-packages/transformers/pytorch_utils.py:242, in apply_chunking_to_forward(forward_fn, chunk_size, chunk_dim, *input_tensors)
    239     # concatenate output at same dimension
    240     return torch.cat(output_chunks, dim=chunk_dim)
--> 242 return forward_fn(*input_tensors)

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:552, in BertLayer.feed_forward_chunk(self, attention_output)
    550 def feed_forward_chunk(self, attention_output):
    551     intermediate_output = self.intermediate(attention_output)
--> 552     layer_output = self.output(intermediate_output, attention_output)
    553     return layer_output

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:466, in BertOutput.forward(self, hidden_states, input_tensor)
    464 hidden_states = self.dense(hidden_states)
    465 hidden_states = self.dropout(hidden_states)
--> 466 hidden_states = self.LayerNorm(hidden_states + input_tensor)
    467 return hidden_states

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/normalization.py:190, in LayerNorm.forward(self, input)
    189 def forward(self, input: Tensor) -> Tensor:
--> 190     return F.layer_norm(
    191         input, self.normalized_shape, self.weight, self.bias, self.eps)

File /opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:2515, in layer_norm(input, normalized_shape, weight, bias, eps)
   2511 if has_torch_function_variadic(input, weight, bias):
   2512     return handle_torch_function(
   2513         layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
   2514     )
-> 2515 return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

Sometimes also this one:

/usr/local/src/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

I know that probably is not much to work on. Let me know if you have any advice for me.

transformers==4.36.0
span-marker==1.5.0
torch==2.0.0

num_proc not specified in .map functions

In trainer.py, there are three .map functions where num_proc is not specified.
It should be possible to set this because it speeds up tokenization, spreading and normalizations by a significant amount.

Example from trainer.py row 227:

        with tokenizer.entity_tracker(split=dataset_name):
            dataset = dataset.map(
                tokenizer,
                batched=True,
                remove_columns=set(dataset.column_names) - set(self.OPTIONAL_COLUMNS),
                desc=f"Tokenizing the {dataset_name} dataset",
                fn_kwargs={"return_num_words": is_evaluate},
                num_proc=4, # Added this - should be specifiable
            )

This sped up tokenization by about 4 times.

Different results on ontonotes

Hi Tom,

Sorry for missing your email address. I noticed that you get slightly worse results on ontonotes dataset.

There are several versions of ontonotes. In PL-Marker, I preprocess CONLL-2012 by https://drive.google.com/file/d/1cFVb5thXNCXZkz99E8w4Xg9JB9E_viAa/view?usp=sharing

Best,
Deming

should return same no. of list as of inputs

When inputs are of same, predict should return those many lists. Instead gives 1.

inputs = ["Unknown", "Unknown", "Unknown"]
model = SpanMarkerModel.from_pretrained("model_name")
model.predict(inputs)

Output:
[]

Expected Output:
[[], [], []]

Issue with adding span_marker pipe within spaCy

ValueError Traceback (most recent call last)
in <cell line: 6>()
4 # Load the spaCy model
5 nlp = spacy.load("en_core_web_sm")
----> 6 nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})
7
8 # Feed some text through the model to get a spacy Doc

1 frames
/usr/local/lib/python3.10/dist-packages/spacy/language.py in create_pipe(self, factory_name, name, config, raw_config, validate)
658 lang_code=self.lang,
659 )
--> 660 raise ValueError(err)
661 pipe_meta = self.get_factory_meta(factory_name)
662 # This is unideal, but the alternative would mean you always need to

ValueError: [E002] Can't find factory for 'span_marker' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, spancat_singlelabel, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer

Package versions:
spacy == 3.5.3 (did not work with 3.5.2 as well)
span_marker == 1.2.1

Evaluation Metrics with Nervalute

Is there a way to use https://pypi.org/project/nervaluate/ instead of the current metric library?
I would like to obtain F1-exact, F1-strict etc,

Entity type

Hi,

Appreciate the amazing work. Is there a list of entity to be detected with the pretrained model?

Thanks.

`Trainer.preprocess_dataset` - `KeyError` with `datasets<2.6.0`

I ran the README snippet in a Kaggle notebook (Python 3.7) after running %pip install datasets span_marker transformers. I get a KeyError from trainer.train() / Trainer.preprocess_dataset (see below).

I was able to fix this by upgrading to datasets>=2.6.0. It seems that with the default Kaggle setup datasets==2.1.0 was installed when I ran %pip install datasets span_marker transformers.

I can't see an obvious reason for the change in datasets behaviour in the 2.6.0 release notes so maybe I'm on the wrong track. I thought you might like to know, in case there is a reason and you would like to add a constraint on datasets as a dependency.

Thanks for the package!

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_103/2689077115.py in <module>
     30 )
     31 
---> 32 trainer.train()
     33 trainer.save_model("my_span_marker_model/checkpoint-final")
     34 

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1635             resume_from_checkpoint=resume_from_checkpoint,
   1636             trial=trial,
-> 1637             ignore_keys_for_eval=ignore_keys_for_eval,
   1638         )
   1639 

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1643         self._train_batch_size = batch_size
   1644         # Data loader and number of training steps
-> 1645         train_dataloader = self.get_train_dataloader()
   1646 
   1647         # Setting up training control variables:

/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in get_train_dataloader(self)
    176     def get_train_dataloader(self) -> DataLoader:
    177         """Return the preprocessed training DataLoader."""
--> 178         self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
    179         return super().get_train_dataloader()
    180 

/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in preprocess_dataset(self, dataset, label_normalizer, tokenizer, dataset_name, is_evaluate)
    170             batched=True,
    171             remove_columns=dataset.column_names,
--> 172             desc=f"Tokenizing the {dataset_name} dataset",
    173         )
    174         return dataset

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   1971                 new_fingerprint=new_fingerprint,
   1972                 disable_tqdm=disable_tqdm,
-> 1973                 desc=desc,
   1974             )
   1975         else:

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    518             self: "Dataset" = kwargs.pop("self")
    519         # apply actual function
--> 520         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    521         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    522         for dataset in datasets:

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    485         }
    486         # apply actual function
--> 487         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    488         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    489         # re-apply format to the output

/opt/conda/lib/python3.7/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
    456             # Call actual function
    457 
--> 458             out = func(self, *args, **kwargs)
    459 
    460             # Update fingerprint of in-place transforms + update in-place history of transforms

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
   2341                                 indices,
   2342                                 check_same_num_examples=len(input_dataset.list_indexes()) > 0,
-> 2343                                 offset=offset,
   2344                             )
   2345                         except NumExamplesMismatchError:

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in apply_function_on_filtered_inputs(inputs, indices, check_same_num_examples, offset)
   2218             if with_rank:
   2219                 additional_args += (rank,)
-> 2220             processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
   2221             if update_data is None:
   2222                 # Check if the function returns updated examples

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in decorated(item, *args, **kwargs)
   1913                 )
   1914                 # Use the LazyDict internally, while mapping the function
-> 1915                 result = f(decorated_item, *args, **kwargs)
   1916                 # Return a standard dict
   1917                 return result.data if isinstance(result, LazyDict) else result

/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in <lambda>(batch)
    167         # Tokenize and add start/end markers
    168         dataset = dataset.map(
--> 169             lambda batch: tokenizer(batch["tokens"], labels=batch["ner_tags"], return_num_words=is_evaluate),
    170             batched=True,
    171             remove_columns=dataset.column_names,

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in __getitem__(self, key)
    123 class Batch(LazyDict):
    124     def __getitem__(self, key):
--> 125         values = super().__getitem__(key)
    126         if self.features and key in self.features:
    127             values = [

/opt/conda/lib/python3.7/collections/__init__.py in __getitem__(self, key)
   1025         if hasattr(self.__class__, "__missing__"):
   1026             return self.__class__.__missing__(self, key)
-> 1027         raise KeyError(key)
   1028     def __setitem__(self, key, item): self.data[key] = item
   1029     def __delitem__(self, key): del self.data[key]

KeyError: 'tokens'

Prevent re-adding contextual information when training with document-level context

See title. This adding contextual information occurs every time the development set is used for a mid-training evaluation. This is a bit of a waste of (training) time.

Training CoNLL03 with doc-level context should reproduce it.

MobileBert

HI,
You did a very good job. I tried SpanMarkerNer with a pretrained model "bert-base-multilingual-cased" and I had good results. But now I want to use my code with mobileBert.
Do you know if it's possible ?
And how to do with https://tfhub.dev/tensorflow/mobilebert_multi_cased_L-24_H-128_B-512_A-4_F-4_OPT/1 ( for multilangual) ?
Thank you

SpanMarker with ONNX models

Hi @tomaarsen! Is there a ONNX exporter planned? Have you tried using SpanMarker with ONNX models for inference?
Would be really curious if you experimented with that already! :-)

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing

Hello!

This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:

# ✅
model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
# ❌
model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")

# You can also supply a list of words directly: ✅
model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])

This is a consequence of the RoBERTa tokenizer distinguishing , and , as different tokens, and the SpanMarker model is only familiar with the , variant.

Another alternative is to use the spaCy integration, which preprocesses the text into words for you!

The (m)BERT-based SpanMarker models do not require this preprocessing.

Tom Aarsen

inference time cpu vs gpu

I have used gte-tiny embeddings for my custom NER model and need to speed up the inference time.
below are stats for different batch sizes.

Batch Size	Average Inference Time (ms)- GPU	Average Inference Time (ms)- CPU
16	0.14945	1.23388
32	0.28	3.24456
64	0.51582	6.57234
128	1.10669	13.73319
256	2.24729	28.236

Is there any specific method to enhance it? @tomaarsen