flipz357 / smatchpp Goto Github PK

A package for handy processing of semantic graphs such as AMR, with a special focus on standardized evaluation

License: GNU General Public License v3.0

Python 99.86% Shell 0.14%

amr graphs matching amr-graphs amr-parsing graph-algorithms semantic-parsers semantic-parsing abstract-meaning-representation

smatchpp's People

Contributors

Stargazers

Watchers

Forkers

danielhers spongeorge

smatchpp's Issues

Include the option to return all bootstrap scores and set random_state

It could be useful if we could retrieve all scores from the bootstrapping. This can be useful when we want to compare multiple systems for significancy. Since you are using scipy bootstrap, I think youcan just optionally also return "bootsrap_distribution".

Secondly, for reproducibility, it might be a good idea to allow the option to provide a random state (fixed seed) which is then passed to scipy's bootstrap function (random_state parameter).

-output_format json is not respected with -score_type micromacro

I am running this command

                python -m smatchpp      -a $1 \
                        -b $2 \
                        -solver ilp \
                        -edges dereify \
                        -score_dimension main \
                        -score_type micromacro \
                        -log_level 20 \
                        -output_format json \
                        --bootstrap \
                        --remove_duplicates

And getting this output

-------------------------------
-------------------------------
---------Micro scores----------
-------------------------------
-------------------------------
{
    "main": {
        "F1": {
            "result": 27.12,
            "ci": [
                24.08,
                30.51
            ]
        },
        "Precision": {
            "result": 26.44,
            "ci": [
                22.81,
                30.18
            ]
        },
        "Recall": {
            "result": 27.84,
            "ci": [
                24.29,
                31.72
            ]
        }
    }
}


-------------------------------
-------------------------------
---------Macro scores----------
-------------------------------
-------------------------------
{
    "main": {
        "F1": {
            "result": 27.24,
            "ci": [
                23.99,
                30.41
            ]
        },
        "Precision": {
            "result": 28.61,
            "ci": [
                24.83,
                32.7
            ]
        },
        "Recall": {
            "result": 28.84,
            "ci": [
                24.92,
                32.74
            ]
        }
    }
}

This is not json :)

Instability of mip?

Hello

I've just reimplemented my neural network training pipeline and instead of smatch I am using smatchpp. Overall this works great, so thank you for your work!

Unfortunately however I sometimes get a terminal error that disrupts the whole training loop and it cannot be recovered. I have also reported this here. I do not know how to debug this so I am wondering/hoping that you have experience with a similar issue when using mip during testing your library.

This is the error trace, but I can't figure out how to read it. Is mip the trigger, or is torch the trigger? Does it have to do with distributed training? How can I debug this? A lot of questions... So if you have any insights, they are very welcome because this is stopping me from using it in my code as it completely destroys the training progress. smatch does not rely on mip as far as I know.

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/mip/libraries/cbc-c-linux-x86-64.so(_Z15CbcCrashHandleri+0x119) [0x7f5f955c3459]
/lib64/libc.so.6(+0x54df0) [0x7f6697654df0]
/lib64/libc.so.6(+0xa154c) [0x7f66976a154c]
/lib64/libc.so.6(raise+0x16) [0x7f6697654d46]
/lib64/libc.so.6(abort+0xd3) [0x7f66976287f3]
/lib64/libstdc++.so.6(+0xa1a01) [0x7f66938a1a01]
/lib64/libstdc++.so.6(+0xad37c) [0x7f66938ad37c]
/lib64/libstdc++.so.6(+0xad3e7) [0x7f66938ad3e7]
/lib64/libstdc++.so.6(+0xad36f) [0x7f66938ad36f]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardENS_17ErrorHandlingModeE+0x278) [0x7f64d9cbd4d8]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x19f) [0x7f64d9cc102f]
/lib64/libstdc++.so.6(+0xdb9d4) [0x7f66938db9d4]
/lib64/libc.so.6(+0x9f802) [0x7f669769f802]
/lib64/libc.so.6(+0x3f450) [0x7f669763f450]

KeyError when processing a legal meaning representation graph

Hi, many thanks for providing this convenient tool to calculate SMATCH++! I had a problem when running the script. When comparing the following two graphs, I got an error below. Could you please offer some information on the solution? Thanks!

(s10196 / time.n.08
        :member-of (b10194 / box
                           :member (s10193 / "Oxford country"
                                           :Part s10195)
                           :member (s10195 / city.n.01
                                           :Name s10196)))

(b0 / "box"
	:member (s0 / "city.n.01"
		:Name "Bilbao"
		:Part-of (s3 / "country.n.02"
			:Name "Basque Country"))
	:member (s1 / "be.v.01"
		:Theme s0
		:Time (s2 / "time.n.08"
			:EQU "now"))
	:member s2
	:member s3)```

``File "MA_Thesis/SBN-evaluation-tool/2.evaluation-tool-detail/smatchpp/smatchpp/data_helpers.py", line 120, in _string2graph
    triple = (tmpsrc[nested_level], tmprel[nested_level], tgt) 
KeyError: 3``

Better console interaction and more intuitive extension to other graphs

For meaningfully evaluating any form of graph without any pre-processing on the graphs, we can call:

python -m smatchpp      -a <graphs1> \
			-b <graphs2> \
			-solver ilp \
			-score_type micromacro \
			--bootstrap \

Currently, to achieve best eval practice for AMR, we call:

python -m smatchpp      -a <graphs1> \
			-b <graphs2> \
			-solver ilp \
			-syntactic_standardization dereify \
			-score_type micromacro \
			--bootstrap \
			--remove_duplicates \

So we note that there are some general commands that may be useful for ALL graphs (optimizer, bootstrap) but also more or less specific AMR arguments (dereify, maybe also remove duplicates).

Proposal would be to disallow all very AMR specific arguments, and just pass an argument like -graph_type amr that loads a best-practice AMR-specifc standardization model that must be defined somewhere (e.g., maybe just in a standardizer object similar to as is the case already). Then we can access best-practice AMR eval more simply as:

python -m smatchpp      -a <graphs1> \
			-b <graphs2> \
			-solver ilp \
			-graph_type amr \
			-score_type micromacro \
			--bootstrap \

This would also highlight how to customize well for other potential kinds of graphs.

Scoring examples in Python

Thanks for this library!

It would be great if the API had some easy access point for scoring in Python, like

score(graph_a: str, graph_b: str): scoring two penman graphs against each other
score_all(graphs_a: List[str], graphs_b: List[b]): scoring two corpora against each other

That should make it easier to use the library during machine learning experiments, for instance. It is currently not clear how that can be done.

There are some Python examples here but for someone who does not know the implementation details it is hard to know what to use. In other words, how can I do the same commands as written here but directly in Python and not with files but with in-memory objects (lists of penmans)?

Thanks

difference with Smatch

Hi
after creating an issue for https://github.com/snowblink14/smatch I found a case where Smatch and Smatchpp score very differently, and I'm not sure which is the ideal/correct score:

# ::snt The boy is a hard worker.
(p / person
      :domain (b / boy)
      :ARG0-of (w / work-01
            :manner (h / hard)))

and `

(w / worker
      :mod (h / hard)
      :domain (b / boy))

give Precision: 0.5000, Recall: 0.6667, F-score: 0.5714 with Smatch and
F1: 42.86 Precision: 37.5 Recall: 50.0 with Smatchpp (with hillclimber and ilp)
Which score is correct ?
Transformed in triples (S,P,O) the two graphs correspond to

TOP     :top            p
p	:instance	person
b	:instance	boy
w	:instance	work-01
h	:instance	hard
p	:domain	        b
w	:ARG0	        p
w	:manner         h

(8 edges)

and

TOP     :top            w
w	:instance	worker
h	:instance	hard
b	:instance	boy
w	:mod	        h
w	:domain 	b

(6 edges)

Smatch (-v) aligns p(person)-w(worker) b(boy)-b(boy) w(work-01)-Null h(hard)-h(hard) (2 correct)
and aligns the incoming relations domain and top (another 2).
that is 4 matches, with 8 in the first and 6 in the second graphs it calculates recall 4/8 (0.5), precision 6/8 (0.66...) and F1 (2*P*R/(R+P) = 0.5714)

Who is right in your opinion, smatch or smatchpp ? Evaluation of AMR predictions depend on this, for instance a prediction on the AMR3.0 test file scores P: 0.8071, R: 0.8371, F1: 0.8218 with Smatch (micro) and P: 80.5, R: 83.44, F1: 81.94 with SmatchPP (micro).

Licensing of frame role file

As also discussed in this issue in the penman library by @goodmami, the license situation of the predicate file is not 100% clear (folks suspect its under public license, but nobody seems to know for sure).

The file is only relevant for advanced finer semantic scoring, so I propose to download it on demand, and remove it from the repository assets.

Brainstorming: interesting AMR-tailored graph transformations

I think it would be interesting to have the option for some additional useful but optional AMR graph transformations:

They could make sense for some application or parsing diagnostics.

To start with some examples:

Sense2Node

E.g.,

(j / jump-01
      :arg0 (f / frog))

would map to

(j / jump
      :sense 01
      :arg0 (f / frog))

This transformation may be useful for some applications (and something like this could also give us a balanced mix of the concept-as-root and anonymous root issue @goodmami @jheinecke)

AbstractifyNode

E.g.,

(c / city
      :name (n / name
             :op1 "Berlin"))

would map to

(l / location
      :name (n / name
             :op1 "Berlin"))

This transformation may be useful for an additional informative parsing evaluation score. We could use the concept groups that I started building here

And so on...

Crash When Running score.sh

I tried this and it crashed on one dataset but worked on another. Here's the test set that crashed
test-gold.txt and test-pred.txt.

python3 -m smatchpp  -a $GOLD \
			         -b $PRED \
			         -solver ilp \
			         -syntactic_standardization dereify \
			         -score_dimension main \
			         -score_type micromacro \
			         -log_level 20 \
			         --bootstrap \

With the traceback..

...
2023-11-28 16:16:00,233 - __main__ - INFO - bindings - graph pairs processed: 1500; time for last 100 pairs: 2.1990966796875
2023-11-28 16:16:03,175 - __main__ - INFO - bindings - graph pairs processed: 1600; time for last 100 pairs: 2.9419069290161133
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/bjascob/.local/lib/python3.10/site-packages/smatchpp/__main__.py", line 162, in <module>
    match_dict, status = SMATCHPP.process_corpus(amrs, amrs2)
  File "/home/bjascob/.local/lib/python3.10/site-packages/smatchpp/bindings.py", line 128, in process_corpus
    match, tmpstatus, _ = self.process_pair(a, amrs2[i])
  File "/home/bjascob/.local/lib/python3.10/site-packages/smatchpp/bindings.py", line 71, in process_pair
    g2 = self.graph_reader.string2graph(string_g2)
  File "/home/bjascob/.local/lib/python3.10/site-packages/smatchpp/interfaces.py", line 60, in string2graph
    triples = self._string2graph(string)
  File "/home/bjascob/.local/lib/python3.10/site-packages/smatchpp/data_helpers.py", line 85, in _string2graph
    triple = (tmpsrc[nested_level], tmprel[nested_level], stringtok)     
KeyError: 5

Looks like you've got an array indexing error.

I don't need this fixed, but since I spotted the issue I thought I'd pass it along.

Add tsv writer

Smatchpp already can read tsv, but it cannot write tsv yet:

Expected actions

triples = [("ROOT_OF_GRAPH", ":root", "t"), 
              ("t", ":instance", "test")]
from smatchpp.preprocess import TSVWriter
TSVWriter.graph2string(triples)

Expected output of graph2string:

ROOT_OF_GRAPH     t    :root
t    test     :instance

Note that any root should be explicit, since tsv is more general than penman (which always has a implied root).

Include point estimate (mean) of bootstrap

If I understand the output correctly, when we are bootstrapping we get results like this:

        "result": 81.3,
        "ci": [
            80.67,
            81.89
        ]

I think that the result is calculated independently, on the full corpus, and ci is the 95% CI min/max. It would be useful to also include the estimated mean based on the bootstrap. As far as I can tell, this is common in research papers too, where you report "85 +- 1.2" where 85 is the estimated mean and 1.2 the CI range with 95% confidence.