nert-nlp / streusle Goto Github PK

View Code? Open in Web Editor NEW

62.0 10.0 17.0 41.26 MB

STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)

License: Creative Commons Attribution Share Alike 4.0 International

Python 96.34% Shell 3.66%

supersenses prepositions multiword-expressions corpus semantics lexical-semantics nlp

streusle's People

Contributors

Stargazers

Watchers

Forkers

ablodge marthaspalmer schangpi kenclr danielhers mkranzlein hafsah2018 ryanamannion wangxieric strategist922 lgessler francesca418 janetlauyeung avineshwar xwixcn mindful wesscivetti

streusle's Issues

In streusvis.py diff, some sentences have red background stretching to the end of the line

Need to specify `encoding="utf-8"` in calls to `open()`

Double-check s-genitives

In our annotation for the LREC paper @ablodge and I disagreed on some tokens. The guidelines have since been revised. We should go through the disagreements and adjudicate them.

'Beneficiary' spelled with 'ﬁ' ligature, so v2 changes to Beneficiary are not being processed

https://github.com/nert-gu/streusle/blob/4079879416f066e5faa1887682dc888c56d861cc/munge.py#L29

This explains why there is a token tagged 'Reciprocation', which is not a v2 label.

Constructions in STREUSLE

Use this thread to make note of interesting constructions where a words-with-spaces MWE analysis is unsatisfying because there is constrained productivity in certain parts of the expression.

`$ makes it easy to find idioms with an opaque possessive slot (e.g. "quick on X's feet").

Unannotated Tokens

These are tokens with null valued function or role. We should figure out what to do with these on the Xposition site.

reviews-008635-0002 with
reviews-010820-0003 in
reviews-010820-0011 on
reviews-017235-0005 for
reviews-021370-0001 for
reviews-026641-0003 with
reviews-029870-0002 for
reviews-030430-0002 from
reviews-034320-0003 to
reviews-035726-0002 in
reviews-039383-0004 on
reviews-042416-0005 to
reviews-045753-0001 in
reviews-053248-0017 my
reviews-081796-0005 for
reviews-081934-0003 since
reviews-081934-0003 to
reviews-081934-0004 as
reviews-081934-0006 to
reviews-088954-0001 of
reviews-093655-0007 as
reviews-121651-0003 on
reviews-158740-0002 of
reviews-160073-0001 on
reviews-163250-0004 for
reviews-192713-0008 like
reviews-193257-0003 to
reviews-207629-0003 for
reviews-211797-0004 like
reviews-217359-0006 for
reviews-225632-0005 to
reviews-228731-0001 with
reviews-228944-0003 on
reviews-311138-0003 to
reviews-323449-0007 to
reviews-326649-0006 like
reviews-329692-0011 on
reviews-332068-0002 for
reviews-333672-0006 as
reviews-336049-0004 out
reviews-339176-0006 with
reviews-348247-0006 for
reviews-372665-0004 with
reviews-376503-0004 with
reviews-377347-0005 like
reviews-382257-0002 in
reviews-391012-0006 through

Decide what to do about UD VERBs considered adjectives in supersense tagging

conllulex2json.py warns about these. Some others may be reflected in !!@ tokens (#15).

Label of strong MWE should only be indexed by the first token

https://github.com/nert-gu/streusle/blob/951a25933d43581982471f6afb4e02e1f5b6c5d2/munge.py#L40-L46

Finish README, decide how release is structured

govobj of "to" in "3 to 4"

"I've been here [3 to 4] times" ( 325292.2)

"to" currently has gov "4" and obj null, but should have gov "3" and obj "4".

related to #38

Data format consistency: non-initial word of strong MWE duplicated in LEXLEMMA field

LEXLEMMA should be empty for non-initial tokens of a strong MWE. But this is not the case for 2 tokens ("appointment", "in").

There is also 1 token that is not part of a weak MWE yet has a WLEMMA ("all").

psseval.py : the function of .goldid and .autoid

Probably a trivial question, but when would you use .goldid or .autoid in the predictions file for psseval.py? What do they each mean?

Thanks!

govobj for AS-AS SMWE's

"as soon as" currently has "soon" as its obj, but since it is a strong MWE, its obj should be the obj of the second "as".

Did "one" on us MWE

reviews-338429-0004

MWE lemma heuristics: use goeswith

We annotate MWEs where there are superfluous authorial spaces ("miss informed", "mean time"), but the lemma retains the space. The UD relation goeswith should be exploited to delete the space. We should also consider enforcing consistency between the MWE annotation and goeswith; right now there are goeswith annotations without corresponding MWE annotations.

Hard-coded directory

Can this line be removed?
https://github.com/nert-gu/streusle/blob/951a25933d43581982471f6afb4e02e1f5b6c5d2/munge.py#L2

psseval.py should support .json as well as .conllulex input

Script to sync .conllu portion of .conllulex with UD_English dev branch

I think the UD_English repository may contain some recent syntactic fixes (lemmas, tags, trees) that have not been incorporated into streusle.conllulex. Need a script to take the not-to-release/sources/reviews/*.conllu files and streusle.conllulex and simply replace the first 10 columns of the latter with the former.

After running the script, be sure to examine the diff to ensure there weren't local fixes that got clobbered. Some may be due to outstanding pull requests: https://github.com/UniversalDependencies/UD_English/pulls

Evaluating lextag tagging performance

Hi!

I'd like to build a system to predict each token's lextag---I think the evaluation script for this is streusleval.py?

If so, it doesn't seem like it's part of the latest release? Also, is the data the same between 4.1 and the master ref? Not sure what the release cycle looks like for STREUSLE, but could be nice to have a minor release with all the improvements since last July :)

so that

Should probably be so_that when used as SCONJ

UD tree errors: coordinated prepositions

For adposition coordination like up and down the street, results of this query have the second preposition attached as conj to the object rather than to the first preposition.

Mentioned in UniversalDependencies/UD_English-EWT#49

Govobj extraction: edge cases involving coordination, approximators, directional particle + PP combinations, etc.

govobj.py can be improved to deal with various syntactic edge cases, some of which result in the undesirable property of listing a SNACS-tagged unit as governed by another SNACS-tagged unit:

Currently, governors of coordinated Ps or PPs are misleading. Better to use Enhanced Dependencies to get the more semantic governor.
Approximators ("ABOUT/AROUND/LESS_THAN/OVER 5 minutes", "BETWEEN 5 and 10 minutes") are treated as having a governor but no object. Though these constructions are syntactically weird, because these prepositions are rarely intransitive in other contexts, better to treat these as having an object but no governor.
Comparative AS-AS: In UD, the second AS-phrase is treated as a dependent of the first AS. Thus, currently, "pay twice AS1 much AS2 they tell you" is currently analyzed as AS1(gov=much, obj=null), AS2(gov=as, obj=tell). Instead, do AS1:p.Cost~>p.Extent(gov=pay, obj=much), AS2:p.ComparisonRef(gov=much, obj=tell).
In some cases, a directional adverb/particle (AWAY/BACK/DOWN/HOME/OUT/OVER) is treated in UD as having a PP complement even though the adverb can be omitted (UniversalDependencies/docs#570): "got BACK FROM france", "OVER BY 16th and 15th". In extracting governors, better to treat these as sisters.
In some cases, the governor is not being retrieved correctly for SNACS expressions that are syntactically analyzed as adverb-modifying-adverb, e.g. "out there" advmod(there, out), "back home".
AGO is analyzed as a postposition in SNACS, but an adverb in UD, hence there needs to be a special rule to extract the object.

Note that a rare legitimate case of a SNACS expression governing another SNACS expression is in a predicate complement: "they were OUT(g) FROM surgery", "I was IN(g) two weeks AGO".

Add tsv files to prepare-4.0 branch

@ablodge Thanks for uploading the files. I've moved them to the prepare-4.0 branch. Can you also add the spreadsheets that the script uses (since you changed the column names)?

BTW I've changed your script so that the input is streusle_v3.sst and the output is streusle_v4.sst. (v1 of preposition supersenses corresponds to v3 of the STREUSLE corpus.)

MWE numbering within sentence is inconsistent

In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.

This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.

In the script for #41:

streusle/UDlextag2json.py

Lines 53 to 62 in 09014b4

 # Note that numbering of strong+weak MWEs doesn't follow a consistent order in the data! 

 # Ordering by first token offset (tiebreaker to strong MWE): 

 #xgroups = [(min(sg),'s',sg) for sg in sgroups] + [(min(wg),'w',wg) for wg in wgroups] 

 # Putting all strong expressions before any weak expressions: 

 xgroups = [(None,'s',sg) for sg in sgroups] + [(None,'w',wg) for wg in wgroups] 

 # This means that the MWE columns are not *completely* determined by 

 # the lextag in a way that matches the original data, but different MWE 

 # orders does not matter semantically. 

 # See also check in _postproc_sent(), which ensures that the MWE numbers 

 # count from 1, but does not mandate an order.

streusle/UDlextag2json.py

Lines 124 to 129 in 09014b4

 # check that MWEs are numbered from 1 

 # fix_mwe_numbering.py was written to correct this 

 # However, this does NOT require a particular sort order of the MWEs in the sentence. 

 # It just requires that they have unique numbers 1, ..., N if there are N MWEs. 

 for i,(k,mwe) in enumerate(sorted(chain(sent['smwes'].items(), sent['wmwes'].items()), key=lambda x: int(x[0])), 1): 

 assert int(k)==i,(sent['sent_id'],i,k,mwe)

(be) the best, one of the best, the best of the best

Should be the_best (when predicative) and the_best_of_the_best? Lexcat=ADJ?

Evaluation script that unpacks lextag into remaining STREUSLE columns

Re: #40, we need a script that takes lextags (full tags, one per token) output by a system and parses them to extract MWE groupings.

Lextags are the 19th and final column in the .conllulex format. Columns 1-10 are UD. Columns 11-18 can be filled in based on UD+lextags.

munge.py should preserve STREUSLE v3 sentence order

rather than sorting the sentence IDs lexicographically. This will make it easier to do a diff between the two versions.

psst-tokens_genitive.ablodgett.csv contains preposition tokens in addition to genitives

When creating the CSV files I neglected to realize that my spreadsheet editor was exporting hidden rows and columns.

stats.sh should report the number of gappy expressions

(and maybe the number of gaps, strong vs. weak gaps, number of tokens in gaps, etc.)

Finish guidelines

missing normalization: 67051:5

67051:5 4 4 SCONJ IN _ 6 mark 6:mark _ _ P 4 p.Explanation p.Explanation _ _ _ O-P-p.Explanation

Lexcat heuristics: PP false positives

Ken Litkowski noticed that PP is erroneously the lexcat for in hope to, just about, and nothing but, which should be P. This is because the UPOS of the last word in the MWE is PART, ADV, or CCONJ. Under the current heuristics in lexcatter.py, an MWE is treated as P only if the last word is tagged as ADP or SCONJ.

User-friendly concordance format and token update script

For revising certain classes of annotations (e.g., P supersenses where the scene role in Manner) it would be useful to have a concordance view. This would put a token's context on the same line for easy sorting and batch editing. So it would be a more human-readable view of the lexical annotation.

Does tquery.py already do this? Should it be run when building the release to produce a row for every supersense-annotated strong lexical expression, within the train/dev/test subdiretories? This would make diffs in commit history easier to read. (Not having this in the root directory would make it clear that .conllulex is the canonical data file.)

There would need to be a script to apply supersense edits made in the concordance view back to the original. untquery.py? tupdate.py?

Is there a natural way to specify MWE edits in the concordance view, also? Currently, adding an MWE or changing the strength of an existing MWE is painful to implement by hand in .conllulex.

Format extension: incorporating annotator notes?

The version of STREUSLE in Xposition contains some annotator notes on P tokens that are not included in the official release. The notes can help clarify the interpretation of the text, provide the annotator's rationale, or help cluster different usages at a finer level of granularity than the supersenses.

Should the .conllulex format have a place for these? An extra column? Or maybe a sentence header row, as they are rare?

Should there also be a standard for releasing rich annotation history metadata (such as who annotated which token, original vs. adjudicated annotations, timestamps, ...)?

missing normalization: 21638:15

21638:15 a a ADV RB _ 16 case 16:case _ 1:1 PP a least p.Approximator p.Approximator _ _ _ B-PP-p.Approximator

Possession-related data review

In general, p.Possessor is only used as function if the scene role is also p.Possessor. But there are a few exceptions which may be inconsistencies.
Review p.Originator~>p.Gestalt. Some look like they should be p.Possessor because the governor is the entity, not the transfer event.

even if, even though, not even

These are sometimes analyzed as MWEs, but the annotations are inconsistent.

Note that "not even if" occurs, which would be problematic if not_even and even_if are both treated as MWEs.

Unified evaluation script

We have psseval.py for SNACS (preposition/possessive) supersenses, but we should have a script that does full evaluation of MWEs and all kinds of supersenses. Cf. https://github.com/dimsum16/dimsum-data/blob/master/scripts/dimsumeval.py

Attested but undocumented construals

The following 30 construals are attested between 1 and 3 times in the data, but not documented in the current guidelines. Let's look at them to see which are worth documenting, which are borderline but worth keeping in the data, and which should be reannotated.

   1 p.Agent	p.Locus
   3 p.Beneficiary	p.Gestalt
   1 p.Characteristic	p.Manner
   1 p.ComparisonRef	p.Beneficiary
   1 p.Cost	p.Extent
   1 p.Direction	p.Goal
   1 p.Experiencer	p.Agent
   1 p.Explanation	p.Manner
   1 p.Extent	p.Whole
   2 p.Gestalt	p.Purpose
   1 p.Gestalt	p.Source
   2 p.Gestalt	p.Topic
   1 p.Goal	p.Whole
   3 p.Instrument	p.Manner
   1 p.Instrument	p.Theme
   3 p.Manner	p.Source
   1 p.Manner	p.Topic
   1 p.Means	p.Path
   1 p.Originator	p.Instrument
   1 p.Possession	p.PartPortion
   3 p.Possession	p.Theme
   2 p.Purpose	p.Goal
   2 p.Purpose	p.Locus
   1 p.Purpose	p.Theme
   1 p.SocialRel	p.Source
   3 p.Stimulus	p.Source
   1 p.Stimulus	p.Theme
   1 p.Theme	p.Accompanier
   1 p.Theme	p.Characteristic
   2 p.Time	p.Extent

Address all !!@ and !@ tokens

This is used as a stopgap lexcat for non-prepositional tokens that need to be revised—typically ones that need a N or V supersense.

Add json2conllulex conversion script

Currently we can convert from .conllulex to .json, but not the reverse.

Mentioned in #41 (comment)

Guidelines addition, lexcat consistency check: PP vs. P

See nert-nlp/Xposition#111 (comment)

Update syntax to UD v2.2

https://github.com/UniversalDependencies/UD_English-EWT master branch

At beginning of govobj.py, document AS-AS constructions

he is AS(g=he, c=pred, o=tall) tall AS(g=tall, o=house) a horse
he is AS(g=he, c=pred, o=tall) tall AS(g=tall, o=wide, c=subord) a horse is wide

streusvis.py: consider flagging errors, adding space to align tokens across sentences

Currently, each sentence is rendered separately with MWEs and supersenses, and color is added post hoc to annotations based on a regex.

Given the gold and predicted sentences below:

No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
No more having to drive to|p.Goal San Francisco for|p.Purpose a great mani pedi .

It would be nice to

highlight where the prediction was incorrect (maybe with a red background and white text for a missing or extra label and red text for an incorrect label, or maybe just by making the word red if either the MWE analysis or the supersense was incorrect)

align the tokens, i.e.

  No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
  No more having to drive          to|p.Goal San_Francisco            for|p.Purpose a great mani pedi       .

Revisit predicative/copular MWEs

egrep -v '^$' streusle.conllulex | egrep -v '^#' | cut -f13 | egrep '^be ' | sort | uniq -c
   1 be a big baby
   3 be a joke
   1 be a nice touch
   1 be a no brainer
   1 be a pain
   1 be a plus
   1 be first call
   1 be happy camper
   1 be in
   2 be in for a treat
   2 be in hand
   1 be inclined
   1 be make to
   1 be no more
   1 be out of this world
   1 be rude
   1 be say and do
   8 be suppose to
   1 be sure
   1 be sure to
   1 be there
   1 be there / do that
   1 be through
  16 be to
   1 be up

These might qualify as LVC.cause under the PARSEME 1.1 guidelines, though it's such a productive construction that I'd be reluctant to call these MWEs.

	# Note that numbering of strong+weak MWEs doesn't follow a consistent order in the data!
	# Ordering by first token offset (tiebreaker to strong MWE):
	#xgroups = [(min(sg),'s',sg) for sg in sgroups] + [(min(wg),'w',wg) for wg in wgroups]
	# Putting all strong expressions before any weak expressions:
	xgroups = [(None,'s',sg) for sg in sgroups] + [(None,'w',wg) for wg in wgroups]
	# This means that the MWE columns are not completely determined by
	# the lextag in a way that matches the original data, but different MWE
	# orders does not matter semantically.
	# See also check in _postproc_sent(), which ensures that the MWE numbers
	# count from 1, but does not mandate an order.

	# check that MWEs are numbered from 1
	# fix_mwe_numbering.py was written to correct this
	# However, this does NOT require a particular sort order of the MWEs in the sentence.
	# It just requires that they have unique numbers 1, ..., N if there are N MWEs.
	for i,(k,mwe) in enumerate(sorted(chain(sent['smwes'].items(), sent['wmwes'].items()), key=lambda x: int(x[0])), 1):
	assert int(k)==i,(sent['sent_id'],i,k,mwe)