neulab / interpreteval Goto Github PK

View Code? Open in Web Editor NEW

193.0 193.0 14.0 139.88 MB

Interpretable Evaluation for (Almost) All NLP Tasks

Python 39.82% TeX 2.19% CSS 0.57% HTML 50.39% Perl 5.66% Shell 1.37%

interpreteval's People

Contributors

Stargazers

Watchers

Forkers

niewm ankitshah009 jinlanfu qinjinghui greitzmann qianrenjian lfhase macdigital360 wcw15 malyala-srikanth sree181 ericmiao0817 hermesndjeng allthingsllm

interpreteval's Issues

Provide more flexibility in specifying corpus types

This line here (1) limits the types of corpora that can be used, and (2) won't fail if you specify an illegal corpus type.
https://github.com/neulab/InterpretEval/blob/master/interpretEval/tensorEvaluation-ner.py#L2401

A couple solutions:

allow the individual variables to be specified
auto-detect the file type by probing the first few lines

"sz" command not found

Upon running run_ner.sh I get the following error near the end of the script (https://github.com/neulab/InterpretEval/blob/master/interpretEval/run_task_ner.sh#L94):

run_task_ner.sh: line 96: sz: command not found

I'm not sure what the "sz" command does, or where to find it. Could you elaborate?

run_task_ner.sh overwrites git-committed files

When run_task_ner.sh is run, it overwrites a lot of files that are committed to git, and as a result when you try to make any changes and run git commit you get a whole bunch of files that are listed as "need merge". The output files of run_task_ner.sh files should probably be added to .gitignore, and maybe be written to a different directory that is not committed.

M	interpretEval/analysis/ner-fig/Flair-ELMo/bucketInfo.pkl
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-eLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-oDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-oDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-sLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-sLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-tCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-tCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-tFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-tFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-tag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-breakdown-tag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-selfdiag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-ELMo-selfdiag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-eLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-oDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-oDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-sLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-sLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-tCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-tCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-tFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-tFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-tag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-breakdown-tag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-selfdiag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair-selfdiag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair_ELMo-aideddiag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair_ELMo-aideddiag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/conll03-Flair_ELMo-heatmap.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/log.latex
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-eLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-oDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-oDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-sLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-sLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-tCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-tCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-tFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-tFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-tag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-breakdown-tag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-selfdiag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-ELMo-selfdiag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-eLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-oDen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-oDen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-sLen.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-sLen.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-tCon.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-tCon.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-tFre.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-tFre.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-tag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-breakdown-tag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-selfdiag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair-selfdiag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair_ELMo-aideddiag.pdf
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair_ELMo-aideddiag.png
M	interpretEval/analysis/ner-fig/Flair-ELMo/wnut16-Flair_ELMo-heatmap.png
M	interpretEval/analysis/tEval-ner.html

pdftoppm: command not found

When running the NER evaluation, you get "pdftoppm command not found" errors if poppler is not installed

Error when generating figures

Hi,

I'm currently running analysis and it seems to be kinda working but it's dying on generating figures.

Traceback (most recent call last):
  File "genFig.py", line 550, in <module>
    genHistTex(dict_breakdown_m2, path_template_breakdown, path_tex_base_breakdown, corpus_type, model_name2, data_mn_attr_range[corpus_type][model_name2], dict_data_bucketInfo[corpus_type])
  File "genFig.py", line 115, in genHistTex
    xticklabel = getTuple(eval(xticklabel_list[idx].split(":")[1]))

Here are scripts to completely reproduce the error, just check out the "analysis_scripts" branch of the repo and run run_all_analyses.sh: https://github.com/neubig/masakhane-ner/tree/analysis_scripts

Could you please help take a look?

Can this be used on a custom dataset?

Support for other NER datasets

Hi,

I really enjoyed reading the paper (great analysis and good attributes for NER tasks) and love the idea of both a leaderboard with detailed information and comparisons between other models ❤️

So I would like to ask, if you do plan to add support for other NER datasets like e.g. CoNLL-2003 (German, both the revisited and the original dataset), CoNLL-2002 (Spanish and Dutch) or even GermEval 2014 (which is an awesome resource for German NER).

I would love to provide you the output of our systems that we've trained for the CoNLL datasets (see paper here with Alan from Flair) and for GermEval (see paper here with the people from Deepset).

Additionally, there's a corrected version of the English CoNLL coming soon (see here) and adding support for it on the ExplainaBoard would be really awesome (to compare the "uncorrected" vs. corrected dataset).

Many thanks for the great work!

Stefan

Bug: Different models generate same breakdown performance pics.

Describe the bug

The generated html shows totally the same result on different models(Flair&ELMo) like this:

It's absolutely that something went wrong because both Flair and ELMo show the same Break-down Performance.
Same bug happens on Self-diagnosis:

Debug and Fix

At first I thought maybe the problem is within the main program,but the txt outcome was correct:

# break-down performance
Flair
eCon	0:0.8938461538461538 1:0.8558758314855874 2:0.9645621181262729 3:0.9733250620347393
tCon	0:0.8864388092613011 1:0.8766129032258065 2:0.9627551020408164 3:0.985190670122177
eFre	0:0.8963153384747216 1:0.9464209172738963 2:0.9558373414954089 3:0.960822722820764
tFre	0:0.9345070422535211 1:0.9395325203252033 2:0.9451523545706371 3:0.9270331083252974
eLen	0:0.9355970253963799 1:0.9315525876460768 2:0.8631578947368422 3:0.8507462686567164
sLen	0:0.9378569029224051 1:0.9269841269841269 2:0.9219178082191781 3:0.9319938176197835
eDen	0:0.9208025343189017 1:0.9346981997882103 2:0.9467226348078787 3:0.9323786793953858
oDen	0:0.9352896914973664 1:0.9170383586083855 2:0.89103690685413 3:0.9480401093892433
tag	0:0.9374064091045223 1:0.8423295454545454 2:0.9177877428998505 3:0.974058060531192

ELMo
eCon	0:0.8814928818776453 1:0.8501118568232663 2:0.9590263691683569 3:0.9699624530663328
tCon	0:0.8745598591549296 1:0.872125857200484 2:0.958141909137315 3:0.9838380085454208
eFre	0:0.8852177644282343 1:0.9394589952769429 2:0.9527145359019265 3:0.9527559055118111
tFre	0:0.9295774647887324 1:0.9336721728081323 2:0.9434903047091413 3:0.9157792836398838
eLen	0:0.9279887482419129 1:0.9234209055338177 2:0.8482328482328482 3:0.8467153284671532
sLen	0:0.931986531986532 1:0.9090265486725665 2:0.9142661179698217 3:0.9323017408123792
eDen	0:0.9160789844851905 1:0.9270538243626062 2:0.9272080232934325 3:0.9330677290836653
oDen	0:0.9269641734758014 1:0.907953529937444 2:0.8804920913884007 3:0.9451553930530164
tag	0:0.9340956966596449 1:0.8133903133903134 2:0.9083308450283668 3:0.9715170278637771

# self-diagnosis 
Flair
eCon	1:0.8558758314855874 3:0.9733250620347393 0.1174492305491519
tCon	1:0.8766129032258065 3:0.985190670122177 0.10857776689637044
eFre	0:0.8963153384747216 3:0.960822722820764 0.06450738434604242
tFre	3:0.9270331083252974 2:0.9451523545706371 0.01811924624533967
eLen	3:0.8507462686567164 0:0.9355970253963799 0.08485075673966347
sLen	2:0.9219178082191781 0:0.9378569029224051 0.015939094703226964
eDen	0:0.9208025343189017 2:0.9467226348078787 0.02592010048897697
oDen	2:0.89103690685413 3:0.9480401093892433 0.05700320253511337
tag	1:0.8423295454545454 3:0.974058060531192 0.13172851507664662

ELMo
eCon	1:0.8501118568232663 3:0.9699624530663328 0.11985059624306649
tCon	1:0.872125857200484 3:0.9838380085454208 0.11171215134493684
eFre	0:0.8852177644282343 3:0.9527559055118111 0.06753814108357681
tFre	3:0.9157792836398838 2:0.9434903047091413 0.02771102106925749
eLen	3:0.8467153284671532 0:0.9279887482419129 0.08127341977475966
sLen	1:0.9090265486725665 3:0.9323017408123792 0.02327519213981266
eDen	0:0.9160789844851905 3:0.9330677290836653 0.01698874459847477
oDen	2:0.8804920913884007 3:0.9451553930530164 0.06466330166461576
tag	1:0.8133903133903134 3:0.9715170278637771 0.15812671447346371

So the math is correct, after some efforts, I found the incorrct problem(genFig.py line467-489):

        elif block.find("break-down performance") != -1:
        metaInfo_m1 = extValue(block, model_name1+":\n", "\n\n")
        metaInfo_m2 = extValue(block, model_name2+":\n", "\n\n")
        dict_breakdown_m1 = str2dict(metaInfo_m1)
        dict_breakdown_m2 = str2dict(metaInfo_m2)


    elif block.find("self-diagnosis") != -1:
        metaInfo_m1 = extValue(block, model_name1+":\n", "\n\n")
        metaInfo_m2 = extValue(block, model_name2+":\n", "\n\n")
        dict_self_diag_m1 = str2dict(metaInfo_m1)
        dict_self_diag_m2 = str2dict(metaInfo_m2)

    elif block.find("aided-diagnosis line-chart") != -1:
        metaInfo_m1_2 = extValue(block, model_name1+"_"+model_name2+ ":\n", "\n\n")
        dict_aided_diag_hist_m1_2 = str2dict(metaInfo_m1_2)




    elif block.find("aided-diagnosis heatmap") != -1:
        metaInfo_m1_2 = extValue(block, model_name1+"_"+model_name2+ ":\n", "\n\n")
        dict_aided_diag_heatmap_m1_2 = str2dict(metaInfo_m1_2)

The extValue() takes in a parameter like this:model_name1+":\n", however , if you look at the block(first parameter this method takes), you will find out that model name doesn't end with ':\n', it just end with '\n'.
So after you change the codeblock into:(delete all colons)

    elif block.find("break-down performance") != -1:
        metaInfo_m1 = extValue(block, model_name1+"\n", "\n\n")
        metaInfo_m2 = extValue(block, model_name2+"\n", "\n\n")
        dict_breakdown_m1 = str2dict(metaInfo_m1)
        dict_breakdown_m2 = str2dict(metaInfo_m2)


    elif block.find("self-diagnosis") != -1:
        metaInfo_m1 = extValue(block, model_name1+"\n", "\n\n")
        metaInfo_m2 = extValue(block, model_name2+"\n", "\n\n")
        dict_self_diag_m1 = str2dict(metaInfo_m1)
        dict_self_diag_m2 = str2dict(metaInfo_m2)

    elif block.find("aided-diagnosis line-chart") != -1:
        metaInfo_m1_2 = extValue(block, model_name1+"_"+model_name2+ "\n", "\n\n")
        dict_aided_diag_hist_m1_2 = str2dict(metaInfo_m1_2)




    elif block.find("aided-diagnosis heatmap") != -1:
        metaInfo_m1_2 = extValue(block, model_name1+"_"+model_name2+ "\n", "\n\n")
        dict_aided_diag_heatmap_m1_2 = str2dict(metaInfo_m1_2)

It will work properly and produce:

Because Flair and ELMo performed almost the same, the fix isn't clear.But if you use other models you will see it clearly.
I will submit a pull request for this fix, it's not a big deal but I am really happy I can be a part of this gorgeous project!!

Proposal to make adding new datasets easier

Hi,

I have a proposal to make adding new datasets easier:

For each dataset, consolidate the location of all the related data into something like:

data/ner/conll03/conll03.conf
data/ner/conll03/data
data/ner/conll03/results
data/ner/conll03/precomputed

so it's clear where all the data is.

Show an example of going from just the "data" and "results" directory (i.e. stuff that's in easy-to-interpret formats such as CoNLL format) to a full report, including the precomputation.

What do you think? This would be very helpful and hopefully not too much work?

Some pre-computed features not available, causing crashes

Hi,

I'm trying to run on a new dataset, and some features like eCon seem to require precomputation that is not documented anywhere. Here is the line that's failing:
https://github.com/neulab/InterpretEval/blob/master/interpretEval/tensorEvaluation-ner.py#L294

Could you give some guidance?

How to upload the model predictions?

Hi,
I could not figure out how to upload the file containing model predictions to ExplainaBoard.
I have a NER model for the CoNLL-2003 dataset and also a file containing its dev and test predictions.
Is it possible to analyze our model by submitting the test predictions to ExplainaBoard?
Thanks.