Giter VIP home page Giter VIP logo

streusle's People

Contributors

ablodge avatar csome avatar danielhers avatar jakpra avatar lgessler avatar nschneid avatar ryanamannion avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

streusle's Issues

Double-check s-genitives

In our annotation for the LREC paper @ablodge and I disagreed on some tokens. The guidelines have since been revised. We should go through the disagreements and adjudicate them.

Constructions in STREUSLE

Use this thread to make note of interesting constructions where a words-with-spaces MWE analysis is unsatisfying because there is constrained productivity in certain parts of the expression.

`$ makes it easy to find idioms with an opaque possessive slot (e.g. "quick on X's feet").

Unannotated Tokens

These are tokens with null valued function or role. We should figure out what to do with these on the Xposition site.

reviews-008635-0002 with
reviews-010820-0003 in
reviews-010820-0011 on
reviews-017235-0005 for
reviews-021370-0001 for
reviews-026641-0003 with
reviews-029870-0002 for
reviews-030430-0002 from
reviews-034320-0003 to
reviews-035726-0002 in
reviews-039383-0004 on
reviews-042416-0005 to
reviews-045753-0001 in
reviews-053248-0017 my
reviews-081796-0005 for
reviews-081934-0003 since
reviews-081934-0003 to
reviews-081934-0004 as
reviews-081934-0006 to
reviews-088954-0001 of
reviews-093655-0007 as
reviews-121651-0003 on
reviews-158740-0002 of
reviews-160073-0001 on
reviews-163250-0004 for
reviews-192713-0008 like
reviews-193257-0003 to
reviews-207629-0003 for
reviews-211797-0004 like
reviews-217359-0006 for
reviews-225632-0005 to
reviews-228731-0001 with
reviews-228944-0003 on
reviews-311138-0003 to
reviews-323449-0007 to
reviews-326649-0006 like
reviews-329692-0011 on
reviews-332068-0002 for
reviews-333672-0006 as
reviews-336049-0004 out
reviews-339176-0006 with
reviews-348247-0006 for
reviews-372665-0004 with
reviews-376503-0004 with
reviews-377347-0005 like
reviews-382257-0002 in
reviews-391012-0006 through

govobj of "to" in "3 to 4"

"I've been here [3 to 4] times" ( 325292.2)

"to" currently has gov "4" and obj null, but should have gov "3" and obj "4".

related to #38

govobj for AS-AS SMWE's

"as soon as" currently has "soon" as its obj, but since it is a strong MWE, its obj should be the obj of the second "as".

MWE lemma heuristics: use goeswith

We annotate MWEs where there are superfluous authorial spaces ("miss informed", "mean time"), but the lemma retains the space. The UD relation goeswith should be exploited to delete the space. We should also consider enforcing consistency between the MWE annotation and goeswith; right now there are goeswith annotations without corresponding MWE annotations.

Script to sync .conllu portion of .conllulex with UD_English dev branch

I think the UD_English repository may contain some recent syntactic fixes (lemmas, tags, trees) that have not been incorporated into streusle.conllulex. Need a script to take the not-to-release/sources/reviews/*.conllu files and streusle.conllulex and simply replace the first 10 columns of the latter with the former.

After running the script, be sure to examine the diff to ensure there weren't local fixes that got clobbered. Some may be due to outstanding pull requests: https://github.com/UniversalDependencies/UD_English/pulls

Evaluating lextag tagging performance

Hi!

I'd like to build a system to predict each token's lextag---I think the evaluation script for this is streusleval.py?

If so, it doesn't seem like it's part of the latest release? Also, is the data the same between 4.1 and the master ref? Not sure what the release cycle looks like for STREUSLE, but could be nice to have a minor release with all the improvements since last July :)

so that

Should probably be so_that when used as SCONJ

Govobj extraction: edge cases involving coordination, approximators, directional particle + PP combinations, etc.

govobj.py can be improved to deal with various syntactic edge cases, some of which result in the undesirable property of listing a SNACS-tagged unit as governed by another SNACS-tagged unit:

  1. Currently, governors of coordinated Ps or PPs are misleading. Better to use Enhanced Dependencies to get the more semantic governor.
  2. Approximators ("ABOUT/AROUND/LESS_THAN/OVER 5 minutes", "BETWEEN 5 and 10 minutes") are treated as having a governor but no object. Though these constructions are syntactically weird, because these prepositions are rarely intransitive in other contexts, better to treat these as having an object but no governor.
  3. Comparative AS-AS: In UD, the second AS-phrase is treated as a dependent of the first AS. Thus, currently, "pay twice AS1 much AS2 they tell you" is currently analyzed as AS1(gov=much, obj=null), AS2(gov=as, obj=tell). Instead, do AS1:p.Cost~>p.Extent(gov=pay, obj=much), AS2:p.ComparisonRef(gov=much, obj=tell).
  4. In some cases, a directional adverb/particle (AWAY/BACK/DOWN/HOME/OUT/OVER) is treated in UD as having a PP complement even though the adverb can be omitted (UniversalDependencies/docs#570): "got BACK FROM france", "OVER BY 16th and 15th". In extracting governors, better to treat these as sisters.
  5. In some cases, the governor is not being retrieved correctly for SNACS expressions that are syntactically analyzed as adverb-modifying-adverb, e.g. "out there" advmod(there, out), "back home".
  6. AGO is analyzed as a postposition in SNACS, but an adverb in UD, hence there needs to be a special rule to extract the object.

Note that a rare legitimate case of a SNACS expression governing another SNACS expression is in a predicate complement: "they were OUT(g) FROM surgery", "I was IN(g) two weeks AGO".

Add tsv files to prepare-4.0 branch

@ablodge Thanks for uploading the files. I've moved them to the prepare-4.0 branch. Can you also add the spreadsheets that the script uses (since you changed the column names)?

BTW I've changed your script so that the input is streusle_v3.sst and the output is streusle_v4.sst. (v1 of preposition supersenses corresponds to v3 of the STREUSLE corpus.)

MWE numbering within sentence is inconsistent

In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.

This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.

In the script for #41:

# Note that numbering of strong+weak MWEs doesn't follow a consistent order in the data!
# Ordering by first token offset (tiebreaker to strong MWE):
#xgroups = [(min(sg),'s',sg) for sg in sgroups] + [(min(wg),'w',wg) for wg in wgroups]
# Putting all strong expressions before any weak expressions:
xgroups = [(None,'s',sg) for sg in sgroups] + [(None,'w',wg) for wg in wgroups]
# This means that the MWE columns are not *completely* determined by
# the lextag in a way that matches the original data, but different MWE
# orders does not matter semantically.
# See also check in _postproc_sent(), which ensures that the MWE numbers
# count from 1, but does not mandate an order.

streusle/UDlextag2json.py

Lines 124 to 129 in 09014b4

# check that MWEs are numbered from 1
# fix_mwe_numbering.py was written to correct this
# However, this does NOT require a particular sort order of the MWEs in the sentence.
# It just requires that they have unique numbers 1, ..., N if there are N MWEs.
for i,(k,mwe) in enumerate(sorted(chain(sent['smwes'].items(), sent['wmwes'].items()), key=lambda x: int(x[0])), 1):
assert int(k)==i,(sent['sent_id'],i,k,mwe)

Lexcat heuristics: PP false positives

Ken Litkowski noticed that PP is erroneously the lexcat for in hope to, just about, and nothing but, which should be P. This is because the UPOS of the last word in the MWE is PART, ADV, or CCONJ. Under the current heuristics in lexcatter.py, an MWE is treated as P only if the last word is tagged as ADP or SCONJ.

User-friendly concordance format and token update script

For revising certain classes of annotations (e.g., P supersenses where the scene role in Manner) it would be useful to have a concordance view. This would put a token's context on the same line for easy sorting and batch editing. So it would be a more human-readable view of the lexical annotation.

Does tquery.py already do this? Should it be run when building the release to produce a row for every supersense-annotated strong lexical expression, within the train/dev/test subdiretories? This would make diffs in commit history easier to read. (Not having this in the root directory would make it clear that .conllulex is the canonical data file.)

There would need to be a script to apply supersense edits made in the concordance view back to the original. untquery.py? tupdate.py?

Is there a natural way to specify MWE edits in the concordance view, also? Currently, adding an MWE or changing the strength of an existing MWE is painful to implement by hand in .conllulex.

Format extension: incorporating annotator notes?

The version of STREUSLE in Xposition contains some annotator notes on P tokens that are not included in the official release. The notes can help clarify the interpretation of the text, provide the annotator's rationale, or help cluster different usages at a finer level of granularity than the supersenses.

Should the .conllulex format have a place for these? An extra column? Or maybe a sentence header row, as they are rare?

Should there also be a standard for releasing rich annotation history metadata (such as who annotated which token, original vs. adjudicated annotations, timestamps, ...)?

Possession-related data review

  • In general, p.Possessor is only used as function if the scene role is also p.Possessor. But there are a few exceptions which may be inconsistencies.
  • Review p.Originator~>p.Gestalt. Some look like they should be p.Possessor because the governor is the entity, not the transfer event.

even if, even though, not even

These are sometimes analyzed as MWEs, but the annotations are inconsistent.

Note that "not even if" occurs, which would be problematic if not_even and even_if are both treated as MWEs.

Attested but undocumented construals

The following 30 construals are attested between 1 and 3 times in the data, but not documented in the current guidelines. Let's look at them to see which are worth documenting, which are borderline but worth keeping in the data, and which should be reannotated.

   1 p.Agent	p.Locus
   3 p.Beneficiary	p.Gestalt
   1 p.Characteristic	p.Manner
   1 p.ComparisonRef	p.Beneficiary
   1 p.Cost	p.Extent
   1 p.Direction	p.Goal
   1 p.Experiencer	p.Agent
   1 p.Explanation	p.Manner
   1 p.Extent	p.Whole
   2 p.Gestalt	p.Purpose
   1 p.Gestalt	p.Source
   2 p.Gestalt	p.Topic
   1 p.Goal	p.Whole
   3 p.Instrument	p.Manner
   1 p.Instrument	p.Theme
   3 p.Manner	p.Source
   1 p.Manner	p.Topic
   1 p.Means	p.Path
   1 p.Originator	p.Instrument
   1 p.Possession	p.PartPortion
   3 p.Possession	p.Theme
   2 p.Purpose	p.Goal
   2 p.Purpose	p.Locus
   1 p.Purpose	p.Theme
   1 p.SocialRel	p.Source
   3 p.Stimulus	p.Source
   1 p.Stimulus	p.Theme
   1 p.Theme	p.Accompanier
   1 p.Theme	p.Characteristic
   2 p.Time	p.Extent

Address all !!@ and !@ tokens

This is used as a stopgap lexcat for non-prepositional tokens that need to be revised—typically ones that need a N or V supersense.

streusvis.py: consider flagging errors, adding space to align tokens across sentences

Currently, each sentence is rendered separately with MWEs and supersenses, and color is added post hoc to annotations based on a regex.

Given the gold and predicted sentences below:

No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
No more having to drive to|p.Goal San Francisco for|p.Purpose a great mani pedi .

It would be nice to

  • highlight where the prediction was incorrect (maybe with a red background and white text for a missing or extra label and red text for an incorrect label, or maybe just by making the word red if either the MWE analysis or the supersense was incorrect)

  • align the tokens, i.e.

      No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
      No more having to drive          to|p.Goal San_Francisco            for|p.Purpose a great mani pedi       .
    

Revisit predicative/copular MWEs

egrep -v '^$' streusle.conllulex | egrep -v '^#' | cut -f13 | egrep '^be ' | sort | uniq -c
   1 be a big baby
   3 be a joke
   1 be a nice touch
   1 be a no brainer
   1 be a pain
   1 be a plus
   1 be first call
   1 be happy camper
   1 be in
   2 be in for a treat
   2 be in hand
   1 be inclined
   1 be make to
   1 be no more
   1 be out of this world
   1 be rude
   1 be say and do
   8 be suppose to
   1 be sure
   1 be sure to
   1 be there
   1 be there / do that
   1 be through
  16 be to
   1 be up

govobj of "rather than"

"they seemed more interested~in helping me find the right car rather_then just make_ a _sale" (245160.4)

"rather then" has gov "make" and obj null, but in fact "make" should be the obj (and maybe "helping" should be the gov).
In UD, "rather" is a cc of "make", so I can see where this comes from, but it would be nice to have govobj.py handle this.

Causative get/have: supersense

E.g. "get/have something fixed", "get my hair done". For "get my hair done", should "get" be v.change and "done" be v.body?

76 instances of VBN.*xcomp, most of which are this construction. (This doesn't count resultative PPs: "I got her on the phone".)

These might qualify as LVC.cause under the PARSEME 1.1 guidelines, though it's such a productive construction that I'd be reluctant to call these MWEs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.