mantisai / nervaluate Goto Github PK
View Code? Open in Web Editor NEWFull named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
License: MIT License
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
License: MIT License
With the move from travis to github actions, code coverage is broken.
Hi all!
I tried applying the Evaluator to my nested list entities, but it only returns "results" and "results_by_tag".
evaluator = Evaluator(true, predicted, tags=['FAUNA', 'FLORA'], loader="list")
results, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()
Error:
ValueError: not enough values to unpack (expected 4, got 2)
Any idea how I can fix this? Or am I missing something? :)
Thank you!
It would be great for the tool to be able to accept spacy data formats.
Needs to include:
@davidsbatista I see you changed the original code that this is based on into an MIT license. I propose doing the same here. Is that ok with everyone who has contributed?
@davidsbatista @LizGil @aflueckiger @pimmeerdink
cc @nsorros
This is following on from our discussion in #48.
First off, I think the default CLI should be a very simple wrapper for the Evaluate class, so it doesn't need to be much more complicated than (note I didn't test any of this):
./nervaluate/cli.py:
import typer
from nervaluate import Evaluator
app = typer.Typer()
app.command()
def evaluate(
true_path: str = typer.Argument(help="Path to true entity labels"),
pred_path: str = typer.Argument(help="Path to predicted entity labels"),
tags: str = typer.Argument(
None, help="Comma separated list of tags to include in the evaluation"
),
loader: str = typer.Option(
None,
help="Optional loader when not using prodigy style spans. One of [list, conll]",
),
by_tag: bool = typer.Option(
None,
help="If set, will return tag level results instead of aggregated results.",
),
pretty: bool = typer.Option(
None,
help="If set, will print the results in a pretty format instead of returning the raw json",
),
):
tags_list = tags.split(",")
evaluator = Evaluator(true_path, pred_path, tags=tags_list, loader=loader)
results, results_by_tag = evaluator.evaluate()
if by_tag:
output = results_by_tag
else:
output = results
if pretty:
pass
# Some code from wasabi to print a pretty table https://pypi.org/project/wasabi/
else:
return output
if __name__ == "__main__":
app()
For handling predictions directly from a spacy/prodigy model, I think we should implement a typer command that does what @Eleni170 implemented in f2841e2. So it would be something like:
@app.command()
def predict(
model_path: str=typer.Argument(help="Path to spaCy model"),
data_path: str=typer.Argument(
help="Path to data in prodigy format (including the raw text)"
),
by_tag: bool = typer.Option(
None,
help="If set, will return tag level results instead of aggregated results.",
),
pretty: bool = typer.Option(
None,
help="If set, will print the results in a pretty format instead of returning the raw json",
),
):
spacy_model = spacy.load(model_path)
true = []
pred = []
tags = {}
with open(data_path) as f:
for line in f:
pattern = json.loads(line)
text = pattern["text"]
meta = pattern["meta"]
labels = check_labels(meta)
for label in labels:
tags[label] = ''
true.append(meta)
doc = spacy_model(text)
pred.append(create_prodigy_spans(doc))
# Maybe we want also to pass tags to the CLI as above but default to all tags as below if nothing is passed.
evaluator = Evaluator(true, pred, tags=list(tags.keys()))
global_results, aggregation_results = evaluator.evaluate()
# Similar logic as above to print results to console either as raw json or pretty printed.
Hey guys @davidsbatista @ivyleavedtoadflax ,
Using this package to evaluate a NER-tagger I built. It only has to recognize one type of entity so the tags is a list with an empty string. Basically: when i run the following:
t = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I']]
p = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'B', 'I']]
evaluator = Evaluator(t,p, tags= [''], loader='list')
evaluator.evaluate()
I get
({'ent_type': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'partial': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'exact': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}}, {'': {'ent_type': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'partial': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'strict': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}, 'exact': {'correct': 0, 'incorrect': 0, 'partial': 0, 'missed': 0, 'spurious': 1, 'possible': 0, 'actual': 1, 'precision': 0.0, 'recall': 0, 'f1': 0}}})
Whats notable, is that its 'possible' is 0 for all evaluation methods, while clearly there is a tag present in the true list. The same is true for the 'actual': it fails to find the last present tag, leading me to believe the problem is in the loading of the data. This problem occurs more frequently.
Any ideas?
Pim
Thanks for the great library.
Just wondering about the logic of calculating the number of the POSSIBLE tokens.
If my pred is
B-ORG, I-ORG, B-ORG, I-ORG
and my true label is
B-ORG, I-ORG, I-ORG, I-ORG
I think the current logic will calculate the POSSIBLE as 2? But there is only 1 gold-standard annotation.
If 2 is correct, that means POSSIBLE cannot be interpreted as the number of gold-standard entities in the data, am I right?
What other formats do we want the package to be able to use? For starters:
, O
Davos PERSON
2018 O
: O
Soros PERSON
accuses O
Trump PERSON
of O
wanting O
a O
` O
mafia O
state O
' O
and O
blasts O
social O
media O
. O
Switzerland/LOCATION ,/O Davos/PERSON 2018/O :/O Soros/PERSON accuses/O Trump/PERSON of/O wanting/O a/O /O mafia/O state/O '/O and/O blasts/O social/O media/O ./O`
On this blog post about NER evaluation, the author uses nervaluate
, and also shows with a snippet of code how to quickly pack the results from a dictionary to a DataFrame
:
from collections import defaultdict
def flip_nested_dict(dd):
result = defaultdict(dict)
for k1, d in dd.items():
for k2, v in d.items():
result[k2][k1] = v
return dict(result)
I will add this as yet another format to export the results from the Evaluator
.
It'd be good to have a F1 score in the nervaluate output. I personally have to always calculate this after anyway, e.g.
evaluator = Evaluator(true_doc_entities, pred_doc_entities, tags=tags)
results, _ = evaluator.evaluate()
f1 = f1_score(results['partial']['precision'], results['partial']['recall'])
Hi,
Thanks a lot for maintaining this great tool!
A small note on the README: I think there is a typo when defining the Recall of the partial match:
Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FP) # here
I believe it should be
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FN) # here
Hello guys,
As regards, the exact match measure, in some cases (nested entities?) the scorer seems to produce incorrect Recall values. As an example, consider the results obtained by the scorer for the ‘true’ and ‘pred’ sequences below.
For this example I would expect to have TP=Correct=1, FN=1, and Re=TP/(TP+FN)=1/2=0.5. That is because we were able to correctly extract 1 entity (i.e., "start": 1, "end": 2) among the 2 entities in the gold standard. However, the scorer obtains 0.33.
true = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 10}] ]
pred = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 5}, {"label": "PER", "start": 6, "end": 10}] ]
from nervaluate import Evaluator
evaluator = Evaluator(true, pred, tags=['PER'])
results, results_per_tag = evaluator.evaluate()
print(results)
'exact': {'correct': 1, 'incorrect': 2, 'partial': 0, 'missed': 0, 'spurious': 0, 'possible': 3, 'actual': 3, 'precision': 0.3333333333333333, 'recall': 0.3333333333333333, 'f1': 0.3333333333333333}}
First of all, thank you for a very nice evaluation package!
When I was playing with it, my first instinct was to install it from PyPI (https://pypi.org/project/nervaluate/) but it seems the version there (0.1.8
) is rather old (released in 2020) compared to the current version in the master
branch (0.2.0
).
Do you think it would be possible to update the PyPI package by any chance?
Thanks!
The current examples are in a notebook, which:
Let's create some example scritps which can be more easily read and executed.
As Prodigy does not offer a quick command to do precautions, it would be convenient to just be able to pass the data and model to Evaluator and get the metrics instead of having to run a predict script to use the tool.
It would be great to be able to trigger nervaluate
from the command line. Something similar to spacy evaluate
https://spacy.io/api/cli#evaluate.
We can start with nervaluate model_path data_path
The logger might not always be needed and can get annoying if you are evaluating lots of times in a loop, so it'd be nice to be able to turn it off.
I usually turn it off by:
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
but an argument to Evaluator would be nicer.
I'm looking at the implications from this line:
nervaluate/src/nervaluate/evaluate.py
Line 265 in 36fd20e
In particular, I have a fair amount of data that's really unbalanced, so assigning a spurious prediction to every class completely washes out the precision metrics for rare classes. For example, if I end up with 100 spurious tags out of 2000 true entities, and one of my classes only has 20 examples, the precision on that class is now taken out of denominator of 120, regardless of which predicted classes comprised the 100 spurious tags.
Shouldn't the spurious tag just be a false positive for the predicted class, and a false negative for the "outside" class (which we kind of don't care about)? I could maybe be convinced otherwise, but wanted to suggest the change, because this is how I'm currently using this code privately.
Is there a way to find out for which instance during evaluation was marked under 'correct' or 'incorrect' or 'spurious', etc for a particular evaluation schema?
We should test that tests run on a Windows machine through a Windows container
Hello,
First, thanks for the great lib!
I was wondering if you could confirm if the metrics can handle well the nested entities as well as flat ones (i.e. entity spans inside one or many other entity spans) ? In the examples, we only see flat entities. My guess is that the strict metric should hold for both cases.
Thanks in advance
The question might be stupid, but I see there's no actual word given as input in the Prodigy span style.
For an example:
Truth: "Paris" -> [[{"label": "PER", "start": 0, "end": 1}]]
Pred: "London" -> [[{"label": "PER", "start": 0, "end": 1}]]
The evaluation shouldn't have Precision, Recall and F1 as 1.0 but it does. Even if I change the format to conll style, it is of no use.
Am I missing something?
What is a good way to cite nervaluate if we use it in our work? I did not find a citation in the usual locations such as a README.
A quanititve method for evaluating the quality of partial matches is defined in this paper: http://nlp.cs.aueb.gr/pubs/jurix2017.pdf by Ilias Chalkidis.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.