Giter VIP home page Giter VIP logo

nervaluate's People

Contributors

aflueckiger avatar danshatford avatar davidsbatista avatar dependabot[bot] avatar fgh95 avatar infopz avatar ivyleavedtoadflax avatar jackboyla avatar lizgzil avatar pimmeerdink avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nervaluate's Issues

Code coverage

With the move from travis to github actions, code coverage is broken.

Not enough values to unpack (Evaluator)

Hi all!

I tried applying the Evaluator to my nested list entities, but it only returns "results" and "results_by_tag".

evaluator = Evaluator(true, predicted, tags=['FAUNA', 'FLORA'], loader="list")
results, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

Error:

ValueError: not enough values to unpack (expected 4, got 2)

Any idea how I can fix this? Or am I missing something? :)
Thank you!

CLI and getting spans directly from a prodigy or a spaCy model

This is following on from our discussion in #48.

First off, I think the default CLI should be a very simple wrapper for the Evaluate class, so it doesn't need to be much more complicated than (note I didn't test any of this):

./nervaluate/cli.py:

import typer

from nervaluate import Evaluator

app = typer.Typer()

app.command()
def evaluate(
    true_path: str = typer.Argument(help="Path to true entity labels"),
    pred_path: str = typer.Argument(help="Path to predicted entity labels"),
    tags: str = typer.Argument(
        None, help="Comma separated list of tags to include in the evaluation"
    ),
    loader: str = typer.Option(
        None,
        help="Optional loader when not using prodigy style spans. One of [list, conll]",
    ),
    by_tag: bool = typer.Option(
        None,
        help="If set, will return tag level results instead of aggregated results.",
    ),
    pretty: bool = typer.Option(
        None,
        help="If set, will print the results in a pretty format instead of returning the raw json",
    ),
):

    tags_list = tags.split(",")
    evaluator = Evaluator(true_path, pred_path, tags=tags_list, loader=loader)

    results, results_by_tag = evaluator.evaluate()

    if by_tag:
        output = results_by_tag
    else:
        output = results

    if pretty:
        pass
        # Some code from wasabi to print a pretty table https://pypi.org/project/wasabi/
    else:
        return output


if __name__ == "__main__":
    app()

For handling predictions directly from a spacy/prodigy model, I think we should implement a typer command that does what @Eleni170 implemented in f2841e2. So it would be something like:

@app.command()
def predict(
    model_path: str=typer.Argument(help="Path to spaCy model"),
    data_path: str=typer.Argument(
        help="Path to data in prodigy format (including the raw text)"
    ), 
    by_tag: bool = typer.Option(
        None,
        help="If set, will return tag level results instead of aggregated results.",
    ),
    pretty: bool = typer.Option(
        None,
        help="If set, will print the results in a pretty format instead of returning the raw json",
    ),
):
    spacy_model = spacy.load(model_path)

    true = []
    pred = []
    tags = {}

    with open(data_path) as f:
        for line in f:
            pattern = json.loads(line)
            text = pattern["text"]
            meta = pattern["meta"]
            labels = check_labels(meta)
            for label in labels:
                tags[label] = ''
            true.append(meta)
            doc = spacy_model(text)
            pred.append(create_prodigy_spans(doc))

    # Maybe we want also to pass tags to the CLI as above but default to all tags as below if nothing is passed.

    evaluator = Evaluator(true, pred, tags=list(tags.keys()))
    global_results, aggregation_results = evaluator.evaluate()

# Similar logic as above to print results to console either as raw json or pretty printed.

Let me know what you think! @Eleni170 @nsorros

Inconsistent behavior when loading from list

Hey guys @davidsbatista @ivyleavedtoadflax ,

Using this package to evaluate a NER-tagger I built. It only has to recognize one type of entity so the tags is a list with an empty string. Basically: when i run the following:
t = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I']]
p = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'B', 'I']]
evaluator = Evaluator(t,p, tags= [''], loader='list')
evaluator.evaluate()

I get

({'ent_type': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'partial': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'exact': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}}, {'': {'ent_type': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'partial': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'exact': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}}})
Whats notable, is that its 'possible' is 0 for all evaluation methods, while clearly there is a tag present in the true list. The same is true for the 'actual': it fails to find the last present tag, leading me to believe the problem is in the loading of the data. This problem occurs more frequently.

Any ideas?

Pim

Is the number of POSSIBLE different from the number of "B-" tokens?

Thanks for the great library.

Just wondering about the logic of calculating the number of the POSSIBLE tokens.
If my pred is
B-ORG, I-ORG, B-ORG, I-ORG
and my true label is
B-ORG, I-ORG, I-ORG, I-ORG

I think the current logic will calculate the POSSIBLE as 2? But there is only 1 gold-standard annotation.

If 2 is correct, that means POSSIBLE cannot be interpreted as the number of gold-standard entities in the data, am I right?

List of possible formats

What other formats do we want the package to be able to use? For starters:

  • list of labels
  • Prodigy
  • spaCy
  • StanfordNER
  • CoNLLa
  • Single-Lineb
  • xml: TODO (add example)
  • inlineXML: TODO (add example)
  • tsv: TODO (add example)
  • slashTags: TODO (add example)
  • a
,	O
Davos	PERSON
2018	O
:	O
Soros	PERSON
accuses	O
Trump	PERSON
of	O
wanting	O
a	O
`	O
mafia	O
state	O
'	O
and	O
blasts	O
social	O
media	O
.	O
  • b
Switzerland/LOCATION ,/O Davos/PERSON 2018/O :/O Soros/PERSON accuses/O Trump/PERSON of/O wanting/O a/O /O mafia/O state/O '/O and/O blasts/O social/O media/O ./O`

What CoNLL format can exactly be used?

Hi,
I'm trying to evaluate a NER model with annotations I did but I can't seem to find the exact CoNLL format you used in the example "word\tO\nword\tO\B-PER\nword\tI-PER\n"
Here are the formats available for me to export.
Thank you for your help!!
image

Export the results as DataFrame

On this blog post about NER evaluation, the author uses nervaluate, and also shows with a snippet of code how to quickly pack the results from a dictionary to a DataFrame:

from collections import defaultdict

def flip_nested_dict(dd):
    result = defaultdict(dict)
    for k1, d in dd.items():
        for k2, v in d.items():
            result[k2][k1] = v
    return dict(result)

I will add this as yet another format to export the results from the Evaluator.

Add F1 to evaluate output

It'd be good to have a F1 score in the nervaluate output. I personally have to always calculate this after anyway, e.g.

evaluator = Evaluator(true_doc_entities, pred_doc_entities, tags=tags)
results, _ = evaluator.evaluate()
f1 = f1_score(results['partial']['precision'], results['partial']['recall'])

README error

Hi,

Thanks a lot for maintaining this great tool!

A small note on the README: I think there is a typo when defining the Recall of the partial match:

Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FP) # here

I believe it should be

Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FN) # here

exact match and recall values

Hello guys,

As regards, the exact match measure, in some cases (nested entities?) the scorer seems to produce incorrect Recall values. As an example, consider the results obtained by the scorer for the ‘true’ and ‘pred’ sequences below.

For this example I would expect to have TP=Correct=1, FN=1, and Re=TP/(TP+FN)=1/2=0.5. That is because we were able to correctly extract 1 entity (i.e., "start": 1, "end": 2) among the 2 entities in the gold standard. However, the scorer obtains 0.33.

true = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 10}] ]

pred = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 5}, {"label": "PER", "start": 6, "end": 10}] ]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=['PER'])

results, results_per_tag = evaluator.evaluate()

print(results)

'exact': {'correct': 1, 'incorrect': 2, 'partial': 0, 'missed': 0, 'spurious': 0, 'possible': 3, 'actual': 3, 'precision': 0.3333333333333333, 'recall': 0.3333333333333333, 'f1': 0.3333333333333333}}

Update PyPI package

First of all, thank you for a very nice evaluation package!

When I was playing with it, my first instinct was to install it from PyPI (https://pypi.org/project/nervaluate/) but it seems the version there (0.1.8) is rather old (released in 2020) compared to the current version in the master branch (0.2.0).

Do you think it would be possible to update the PyPI package by any chance?

Thanks!

Add examples that don't rely on notebooks

The current examples are in a notebook, which:

  • Don't render in github (for some reason)
  • Require additional dependencies such as jupyter

Let's create some example scritps which can be more easily read and executed.

Turn logger off with argument

The logger might not always be needed and can get annoying if you are evaluating lots of times in a loop, so it'd be nice to be able to turn it off.

I usually turn it off by:

import logging

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

but an argument to Evaluator would be nicer.

Does it really make sense to attribute spurious tags to all types?

I'm looking at the implications from this line:

for true in spurious_tags:

In particular, I have a fair amount of data that's really unbalanced, so assigning a spurious prediction to every class completely washes out the precision metrics for rare classes. For example, if I end up with 100 spurious tags out of 2000 true entities, and one of my classes only has 20 examples, the precision on that class is now taken out of denominator of 120, regardless of which predicted classes comprised the 100 spurious tags.

Shouldn't the spurious tag just be a false positive for the predicted class, and a false negative for the "outside" class (which we kind of don't care about)? I could maybe be convinced otherwise, but wanted to suggest the change, because this is how I'm currently using this code privately.

More information about output?

Is there a way to find out for which instance during evaluation was marked under 'correct' or 'incorrect' or 'spurious', etc for a particular evaluation schema?

Handling Nested Entities

Hello,

First, thanks for the great lib!

I was wondering if you could confirm if the metrics can handle well the nested entities as well as flat ones (i.e. entity spans inside one or many other entity spans) ? In the examples, we only see flat entities. My guess is that the strict metric should hold for both cases.

Thanks in advance

Why aren't tokens matched?

The question might be stupid, but I see there's no actual word given as input in the Prodigy span style.
For an example:

Truth: "Paris" -> [[{"label": "PER", "start": 0, "end": 1}]]
Pred: "London" -> [[{"label": "PER", "start": 0, "end": 1}]]

The evaluation shouldn't have Precision, Recall and F1 as 1.0 but it does. Even if I change the format to conll style, it is of no use.

Am I missing something?

image

Citation?

What is a good way to cite nervaluate if we use it in our work? I did not find a citation in the usual locations such as a README.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.