mantisai / nervaluate Goto Github PK

View Code? Open in Web Editor NEW

148.0 148.0 17.0 282 KB

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

License: MIT License

Python 98.68% Makefile 1.32%

evaluation-metrics machine-learning named-entity-recognition natural-language-processing sequence-models

nervaluate's People

Contributors

Stargazers

Watchers

Forkers

lizgzil pimmeerdink aflueckiger alvarocalle lasp73 rah-man sailfish009 jantrienes pz325 fgh95 techthiyanes lizeyubuaa bandoos hertera1 infopz sunshinezhihuo jackboyla

nervaluate's Issues

Code coverage

With the move from travis to github actions, code coverage is broken.

Not enough values to unpack (Evaluator)

Hi all!

I tried applying the Evaluator to my nested list entities, but it only returns "results" and "results_by_tag".

evaluator = Evaluator(true, predicted, tags=['FAUNA', 'FLORA'], loader="list")
results, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

Error:

ValueError: not enough values to unpack (expected 4, got 2)

Any idea how I can fix this? Or am I missing something? :)
Thank you!

Extend data format to spacy v2 and v3

It would be great for the tool to be able to accept spacy data formats.

Add CONTRIBUTING.md

Needs to include:

adding tests
gitchangelog

Change license to MIT

@davidsbatista I see you changed the original code that this is based on into an MIT license. I propose doing the same here. Is that ok with everyone who has contributed?

@davidsbatista @LizGil @aflueckiger @pimmeerdink

cc @nsorros

CLI and getting spans directly from a prodigy or a spaCy model

This is following on from our discussion in #48.

First off, I think the default CLI should be a very simple wrapper for the Evaluate class, so it doesn't need to be much more complicated than (note I didn't test any of this):

./nervaluate/cli.py:

import typer

from nervaluate import Evaluator

app = typer.Typer()

app.command()
def evaluate(
    true_path: str = typer.Argument(help="Path to true entity labels"),
    pred_path: str = typer.Argument(help="Path to predicted entity labels"),
    tags: str = typer.Argument(
        None, help="Comma separated list of tags to include in the evaluation"
    ),
    loader: str = typer.Option(
        None,
        help="Optional loader when not using prodigy style spans. One of [list, conll]",
    ),
    by_tag: bool = typer.Option(
        None,
        help="If set, will return tag level results instead of aggregated results.",
    ),
    pretty: bool = typer.Option(
        None,
        help="If set, will print the results in a pretty format instead of returning the raw json",
    ),
):

    tags_list = tags.split(",")
    evaluator = Evaluator(true_path, pred_path, tags=tags_list, loader=loader)

    results, results_by_tag = evaluator.evaluate()

    if by_tag:
        output = results_by_tag
    else:
        output = results

    if pretty:
        pass
        # Some code from wasabi to print a pretty table https://pypi.org/project/wasabi/
    else:
        return output


if __name__ == "__main__":
    app()

For handling predictions directly from a spacy/prodigy model, I think we should implement a typer command that does what @Eleni170 implemented in f2841e2. So it would be something like:

@app.command()
def predict(
    model_path: str=typer.Argument(help="Path to spaCy model"),
    data_path: str=typer.Argument(
        help="Path to data in prodigy format (including the raw text)"
    ), 
    by_tag: bool = typer.Option(
        None,
        help="If set, will return tag level results instead of aggregated results.",
    ),
    pretty: bool = typer.Option(
        None,
        help="If set, will print the results in a pretty format instead of returning the raw json",
    ),
):
    spacy_model = spacy.load(model_path)

    true = []
    pred = []
    tags = {}

    with open(data_path) as f:
        for line in f:
            pattern = json.loads(line)
            text = pattern["text"]
            meta = pattern["meta"]
            labels = check_labels(meta)
            for label in labels:
                tags[label] = ''
            true.append(meta)
            doc = spacy_model(text)
            pred.append(create_prodigy_spans(doc))

    # Maybe we want also to pass tags to the CLI as above but default to all tags as below if nothing is passed.

    evaluator = Evaluator(true, pred, tags=list(tags.keys()))
    global_results, aggregation_results = evaluator.evaluate()

# Similar logic as above to print results to console either as raw json or pretty printed.

Let me know what you think! @Eleni170 @nsorros

Inconsistent behavior when loading from list

Hey guys @davidsbatista @ivyleavedtoadflax ,

Using this package to evaluate a NER-tagger I built. It only has to recognize one type of entity so the tags is a list with an empty string. Basically: when i run the following:
t = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I']]
p = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'B', 'I']]
evaluator = Evaluator(t,p, tags= [''], loader='list')
evaluator.evaluate()

I get

({'ent_type': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'partial': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'exact': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}}, {'': {'ent_type': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'partial': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'exact': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}}})
Whats notable, is that its 'possible' is 0 for all evaluation methods, while clearly there is a tag present in the true list. The same is true for the 'actual': it fails to find the last present tag, leading me to believe the problem is in the loading of the data. This problem occurs more frequently.

Any ideas?

Pim

Is the number of POSSIBLE different from the number of "B-" tokens?

Thanks for the great library.

Just wondering about the logic of calculating the number of the POSSIBLE tokens.
If my pred is
B-ORG, I-ORG, B-ORG, I-ORG
and my true label is
B-ORG, I-ORG, I-ORG, I-ORG

I think the current logic will calculate the POSSIBLE as 2? But there is only 1 gold-standard annotation.

If 2 is correct, that means POSSIBLE cannot be interpreted as the number of gold-standard entities in the data, am I right?

List of possible formats

What other formats do we want the package to be able to use? For starters:

,	O
Davos	PERSON
2018	O
:	O
Soros	PERSON
accuses	O
Trump	PERSON
of	O
wanting	O
a	O
`	O
mafia	O
state	O
'	O
and	O
blasts	O
social	O
media	O
.	O

Switzerland/LOCATION ,/O Davos/PERSON 2018/O :/O Soros/PERSON accuses/O Trump/PERSON of/O wanting/O a/O /O mafia/O state/O '/O and/O blasts/O social/O media/O ./O`

true_which_overlapped_with_pred does not get updated properly (1)

davidsbatista/NER-Evaluation#17

What CoNLL format can exactly be used?

Hi,
I'm trying to evaluate a NER model with annotations I did but I can't seem to find the exact CoNLL format you used in the example "word\tO\nword\tO\B-PER\nword\tI-PER\n"
Here are the formats available for me to export.
Thank you for your help!!

Export the results as DataFrame

On this blog post about NER evaluation, the author uses nervaluate, and also shows with a snippet of code how to quickly pack the results from a dictionary to a DataFrame:

from collections import defaultdict

def flip_nested_dict(dd):
    result = defaultdict(dict)
    for k1, d in dd.items():
        for k2, v in d.items():
            result[k2][k1] = v
    return dict(result)

I will add this as yet another format to export the results from the Evaluator.

Add F1 to evaluate output

It'd be good to have a F1 score in the nervaluate output. I personally have to always calculate this after anyway, e.g.

evaluator = Evaluator(true_doc_entities, pred_doc_entities, tags=tags)
results, _ = evaluator.evaluate()
f1 = f1_score(results['partial']['precision'], results['partial']['recall'])

README error

Hi,

Thanks a lot for maintaining this great tool!

A small note on the README: I think there is a typo when defining the Recall of the partial match:

Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FP) # here

I believe it should be

Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FN) # here

exact match and recall values

Hello guys,

As regards, the exact match measure, in some cases (nested entities?) the scorer seems to produce incorrect Recall values. As an example, consider the results obtained by the scorer for the ‘true’ and ‘pred’ sequences below.

For this example I would expect to have TP=Correct=1, FN=1, and Re=TP/(TP+FN)=1/2=0.5. That is because we were able to correctly extract 1 entity (i.e., "start": 1, "end": 2) among the 2 entities in the gold standard. However, the scorer obtains 0.33.

true = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 10}] ]

pred = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 5}, {"label": "PER", "start": 6, "end": 10}] ]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=['PER'])

results, results_per_tag = evaluator.evaluate()

print(results)

'exact': {'correct': 1, 'incorrect': 2, 'partial': 0, 'missed': 0, 'spurious': 0, 'possible': 3, 'actual': 3, 'precision': 0.3333333333333333, 'recall': 0.3333333333333333, 'f1': 0.3333333333333333}}

find_overlap question

davidsbatista/NER-Evaluation#16

Update PyPI package

First of all, thank you for a very nice evaluation package!

When I was playing with it, my first instinct was to install it from PyPI (https://pypi.org/project/nervaluate/) but it seems the version there (0.1.8) is rather old (released in 2020) compared to the current version in the master branch (0.2.0).

Do you think it would be possible to update the PyPI package by any chance?

Thanks!

Add examples that don't rely on notebooks

The current examples are in a notebook, which:

Don't render in github (for some reason)
Require additional dependencies such as jupyter

Let's create some example scritps which can be more easily read and executed.

Evaluator accept prodigy model instead of y_true, y_pred

As Prodigy does not offer a quick command to do precautions, it would be convenient to just be able to pass the data and model to Evaluator and get the metrics instead of having to run a predict script to use the tool.

Loop breaks stop looking for predicted entities that may overlap with true entities

davidsbatista/NER-Evaluation#21

Nervaluate CLI

It would be great to be able to trigger nervaluate from the command line. Something similar to spacy evaluate https://spacy.io/api/cli#evaluate.

We can start with nervaluate model_path data_path

Turn logger off with argument

The logger might not always be needed and can get annoying if you are evaluating lots of times in a loop, so it'd be nice to be able to turn it off.

I usually turn it off by:

import logging

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

but an argument to Evaluator would be nicer.

Does it really make sense to attribute spurious tags to all types?

I'm looking at the implications from this line:

nervaluate/src/nervaluate/evaluate.py

Line 265 in 36fd20e

for true in spurious_tags:

In particular, I have a fair amount of data that's really unbalanced, so assigning a spurious prediction to every class completely washes out the precision metrics for rare classes. For example, if I end up with 100 spurious tags out of 2000 true entities, and one of my classes only has 20 examples, the precision on that class is now taken out of denominator of 120, regardless of which predicted classes comprised the 100 spurious tags.

Shouldn't the spurious tag just be a false positive for the predicted class, and a false negative for the "outside" class (which we kind of don't care about)? I could maybe be convinced otherwise, but wanted to suggest the change, because this is how I'm currently using this code privately.

range is wrong for only 1 token span

davidsbatista/NER-Evaluation#14

More information about output?

Is there a way to find out for which instance during evaluation was marked under 'correct' or 'incorrect' or 'spurious', etc for a particular evaluation schema?

Does the tool work on Windows?

We should test that tests run on a Windows machine through a Windows container

Handling Nested Entities

Hello,

First, thanks for the great lib!

I was wondering if you could confirm if the metrics can handle well the nested entities as well as flat ones (i.e. entity spans inside one or many other entity spans) ? In the examples, we only see flat entities. My guess is that the strict metric should hold for both cases.

Thanks in advance

Why aren't tokens matched?

The question might be stupid, but I see there's no actual word given as input in the Prodigy span style.
For an example:

Truth: "Paris" -> [[{"label": "PER", "start": 0, "end": 1}]]
Pred: "London" -> [[{"label": "PER", "start": 0, "end": 1}]]

The evaluation shouldn't have Precision, Recall and F1 as 1.0 but it does. Even if I change the format to conll style, it is of no use.

Am I missing something?

Citation?

What is a good way to cite nervaluate if we use it in our work? I did not find a citation in the usual locations such as a README.

Quantative assessment of partial matches

A quanititve method for evaluating the quality of partial matches is defined in this paper: http://nlp.cs.aueb.gr/pubs/jurix2017.pdf by Ilias Chalkidis.