nert-nlp / streusle Goto Github PK
View Code? Open in Web Editor NEWSTREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)
License: Creative Commons Attribution Share Alike 4.0 International
STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)
License: Creative Commons Attribution Share Alike 4.0 International
In our annotation for the LREC paper @ablodge and I disagreed on some tokens. The guidelines have since been revised. We should go through the disagreements and adjudicate them.
https://github.com/nert-gu/streusle/blob/4079879416f066e5faa1887682dc888c56d861cc/munge.py#L29
This explains why there is a token tagged 'Reciprocation', which is not a v2 label.
Use this thread to make note of interesting constructions where a words-with-spaces MWE analysis is unsatisfying because there is constrained productivity in certain parts of the expression.
`$
makes it easy to find idioms with an opaque possessive slot (e.g. "quick on X's feet").
These are tokens with null valued function or role. We should figure out what to do with these on the Xposition site.
reviews-008635-0002 with
reviews-010820-0003 in
reviews-010820-0011 on
reviews-017235-0005 for
reviews-021370-0001 for
reviews-026641-0003 with
reviews-029870-0002 for
reviews-030430-0002 from
reviews-034320-0003 to
reviews-035726-0002 in
reviews-039383-0004 on
reviews-042416-0005 to
reviews-045753-0001 in
reviews-053248-0017 my
reviews-081796-0005 for
reviews-081934-0003 since
reviews-081934-0003 to
reviews-081934-0004 as
reviews-081934-0006 to
reviews-088954-0001 of
reviews-093655-0007 as
reviews-121651-0003 on
reviews-158740-0002 of
reviews-160073-0001 on
reviews-163250-0004 for
reviews-192713-0008 like
reviews-193257-0003 to
reviews-207629-0003 for
reviews-211797-0004 like
reviews-217359-0006 for
reviews-225632-0005 to
reviews-228731-0001 with
reviews-228944-0003 on
reviews-311138-0003 to
reviews-323449-0007 to
reviews-326649-0006 like
reviews-329692-0011 on
reviews-332068-0002 for
reviews-333672-0006 as
reviews-336049-0004 out
reviews-339176-0006 with
reviews-348247-0006 for
reviews-372665-0004 with
reviews-376503-0004 with
reviews-377347-0005 like
reviews-382257-0002 in
reviews-391012-0006 through
conllulex2json.py warns about these. Some others may be reflected in !!@
tokens (#15).
"I've been here [3 to 4] times" ( 325292.2)
"to" currently has gov "4" and obj null, but should have gov "3" and obj "4".
related to #38
LEXLEMMA should be empty for non-initial tokens of a strong MWE. But this is not the case for 2 tokens ("appointment", "in").
There is also 1 token that is not part of a weak MWE yet has a WLEMMA ("all").
Probably a trivial question, but when would you use .goldid
or .autoid
in the predictions file for psseval.py
? What do they each mean?
Thanks!
"as soon as" currently has "soon" as its obj, but since it is a strong MWE, its obj should be the obj of the second "as".
reviews-338429-0004
We annotate MWEs where there are superfluous authorial spaces ("miss informed", "mean time"), but the lemma retains the space. The UD relation goeswith
should be exploited to delete the space. We should also consider enforcing consistency between the MWE annotation and goeswith
; right now there are goeswith
annotations without corresponding MWE annotations.
Can this line be removed?
https://github.com/nert-gu/streusle/blob/951a25933d43581982471f6afb4e02e1f5b6c5d2/munge.py#L2
I think the UD_English repository may contain some recent syntactic fixes (lemmas, tags, trees) that have not been incorporated into streusle.conllulex. Need a script to take the not-to-release/sources/reviews/*.conllu files and streusle.conllulex and simply replace the first 10 columns of the latter with the former.
After running the script, be sure to examine the diff to ensure there weren't local fixes that got clobbered. Some may be due to outstanding pull requests: https://github.com/UniversalDependencies/UD_English/pulls
Hi!
I'd like to build a system to predict each token's lextag
---I think the evaluation script for this is streusleval.py
?
If so, it doesn't seem like it's part of the latest release? Also, is the data the same between 4.1 and the master ref? Not sure what the release cycle looks like for STREUSLE, but could be nice to have a minor release with all the improvements since last July :)
Should probably be so_that when used as SCONJ
For adposition coordination like up and down the street, results of this query have the second preposition attached as conj
to the object rather than to the first preposition.
Mentioned in UniversalDependencies/UD_English-EWT#49
govobj.py can be improved to deal with various syntactic edge cases, some of which result in the undesirable property of listing a SNACS-tagged unit as governed by another SNACS-tagged unit:
AS1(gov=much, obj=null)
, AS2(gov=as, obj=tell)
. Instead, do AS1:p.Cost~>p.Extent(gov=pay, obj=much)
, AS2:p.ComparisonRef(gov=much, obj=tell)
.advmod(there, out)
, "back home".Note that a rare legitimate case of a SNACS expression governing another SNACS expression is in a predicate complement: "they were OUT(g) FROM surgery", "I was IN(g) two weeks AGO".
@ablodge Thanks for uploading the files. I've moved them to the prepare-4.0 branch. Can you also add the spreadsheets that the script uses (since you changed the column names)?
BTW I've changed your script so that the input is streusle_v3.sst and the output is streusle_v4.sst. (v1 of preposition supersenses corresponds to v3 of the STREUSLE corpus.)
In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.
This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.
In the script for #41:
Lines 53 to 62 in 09014b4
Lines 124 to 129 in 09014b4
Should be the_best (when predicative) and the_best_of_the_best? Lexcat=ADJ
?
Re: #40, we need a script that takes lextags (full tags, one per token) output by a system and parses them to extract MWE groupings.
Lextags are the 19th and final column in the .conllulex format. Columns 1-10 are UD. Columns 11-18 can be filled in based on UD+lextags.
rather than sorting the sentence IDs lexicographically. This will make it easier to do a diff between the two versions.
When creating the CSV files I neglected to realize that my spreadsheet editor was exporting hidden rows and columns.
(and maybe the number of gaps, strong vs. weak gaps, number of tokens in gaps, etc.)
67051:5 4 4 SCONJ IN _ 6 mark 6:mark _ _ P 4 p.Explanation p.Explanation _ _ _ O-P-p.Explanation
Ken Litkowski noticed that PP
is erroneously the lexcat for in hope to, just about, and nothing but, which should be P
. This is because the UPOS of the last word in the MWE is PART
, ADV
, or CCONJ
. Under the current heuristics in lexcatter.py, an MWE is treated as P
only if the last word is tagged as ADP
or SCONJ
.
For revising certain classes of annotations (e.g., P supersenses where the scene role in Manner) it would be useful to have a concordance view. This would put a token's context on the same line for easy sorting and batch editing. So it would be a more human-readable view of the lexical annotation.
Does tquery.py already do this? Should it be run when building the release to produce a row for every supersense-annotated strong lexical expression, within the train/dev/test subdiretories? This would make diffs in commit history easier to read. (Not having this in the root directory would make it clear that .conllulex is the canonical data file.)
There would need to be a script to apply supersense edits made in the concordance view back to the original. untquery.py? tupdate.py?
Is there a natural way to specify MWE edits in the concordance view, also? Currently, adding an MWE or changing the strength of an existing MWE is painful to implement by hand in .conllulex.
The version of STREUSLE in Xposition contains some annotator notes on P tokens that are not included in the official release. The notes can help clarify the interpretation of the text, provide the annotator's rationale, or help cluster different usages at a finer level of granularity than the supersenses.
Should the .conllulex format have a place for these? An extra column? Or maybe a sentence header row, as they are rare?
Should there also be a standard for releasing rich annotation history metadata (such as who annotated which token, original vs. adjudicated annotations, timestamps, ...)?
21638:15 a a ADV RB _ 16 case 16:case _ 1:1 PP a least p.Approximator p.Approximator _ _ _ B-PP-p.Approximator
p.Possessor
is only used as function if the scene role is also p.Possessor
. But there are a few exceptions which may be inconsistencies.p.Originator~>p.Gestalt
. Some look like they should be p.Possessor
because the governor is the entity, not the transfer event.These are sometimes analyzed as MWEs, but the annotations are inconsistent.
Note that "not even if" occurs, which would be problematic if not_even and even_if are both treated as MWEs.
We have psseval.py for SNACS (preposition/possessive) supersenses, but we should have a script that does full evaluation of MWEs and all kinds of supersenses. Cf. https://github.com/dimsum16/dimsum-data/blob/master/scripts/dimsumeval.py
The following 30 construals are attested between 1 and 3 times in the data, but not documented in the current guidelines. Let's look at them to see which are worth documenting, which are borderline but worth keeping in the data, and which should be reannotated.
1 p.Agent p.Locus
3 p.Beneficiary p.Gestalt
1 p.Characteristic p.Manner
1 p.ComparisonRef p.Beneficiary
1 p.Cost p.Extent
1 p.Direction p.Goal
1 p.Experiencer p.Agent
1 p.Explanation p.Manner
1 p.Extent p.Whole
2 p.Gestalt p.Purpose
1 p.Gestalt p.Source
2 p.Gestalt p.Topic
1 p.Goal p.Whole
3 p.Instrument p.Manner
1 p.Instrument p.Theme
3 p.Manner p.Source
1 p.Manner p.Topic
1 p.Means p.Path
1 p.Originator p.Instrument
1 p.Possession p.PartPortion
3 p.Possession p.Theme
2 p.Purpose p.Goal
2 p.Purpose p.Locus
1 p.Purpose p.Theme
1 p.SocialRel p.Source
3 p.Stimulus p.Source
1 p.Stimulus p.Theme
1 p.Theme p.Accompanier
1 p.Theme p.Characteristic
2 p.Time p.Extent
This is used as a stopgap lexcat for non-prepositional tokens that need to be revised—typically ones that need a N or V supersense.
Currently we can convert from .conllulex to .json, but not the reverse.
Mentioned in #41 (comment)
https://github.com/UniversalDependencies/UD_English-EWT master branch
he is AS(g=he, c=pred, o=tall) tall AS(g=tall, o=house) a horse
he is AS(g=he, c=pred, o=tall) tall AS(g=tall, o=wide, c=subord) a horse is wide
Currently, each sentence is rendered separately with MWEs and supersenses, and color is added post hoc to annotations based on a regex.
Given the gold and predicted sentences below:
No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
No more having to drive to|p.Goal San Francisco for|p.Purpose a great mani pedi .
It would be nice to
highlight where the prediction was incorrect (maybe with a red background and white text for a missing or extra label and red text for an incorrect label, or maybe just by making the word red if either the MWE analysis or the supersense was incorrect)
align the tokens, i.e.
No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
No more having to drive to|p.Goal San_Francisco for|p.Purpose a great mani pedi .
egrep -v '^$' streusle.conllulex | egrep -v '^#' | cut -f13 | egrep '^be ' | sort | uniq -c
1 be a big baby
3 be a joke
1 be a nice touch
1 be a no brainer
1 be a pain
1 be a plus
1 be first call
1 be happy camper
1 be in
2 be in for a treat
2 be in hand
1 be inclined
1 be make to
1 be no more
1 be out of this world
1 be rude
1 be say and do
8 be suppose to
1 be sure
1 be sure to
1 be there
1 be there / do that
1 be through
16 be to
1 be up
"they seemed more interested~in helping me find the right car rather_then just make_ a _sale" (245160.4)
"rather then" has gov "make" and obj null, but in fact "make" should be the obj (and maybe "helping" should be the gov).
In UD, "rather" is a cc of "make", so I can see where this comes from, but it would be nice to have govobj.py handle this.
Probably should be Gestalt, not Possessor
We need a way to get these variables from the json format of streusle.
E.g. "get/have something fixed", "get my hair done". For "get my hair done", should "get" be v.change and "done" be v.body?
76 instances of VBN.*xcomp
, most of which are this construction. (This doesn't count resultative PPs: "I got her on the phone".)
These might qualify as LVC.cause under the PARSEME 1.1 guidelines, though it's such a productive construction that I'd be reluctant to call these MWEs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.