For revising certain classes of annotations (e.g., P supersenses where the scene role

User-friendly concordance format and token update script about streusle HOT 4 CLOSED

nert-nlp commented on June 30, 2024

User-friendly concordance format and token update script

from streusle.

Comments (4)

nschneid commented on June 30, 2024

Need to determine whether tquery.py can print token IDs (for updateability). Rather than displaying the full sentence as a string with the target token highlighted, maybe display the left and right context in separate columns, and cap their length. Consider showing MWE markup in the context.

from streusle.

nschneid commented on June 30, 2024

Option to show full tagging in context, like in streusvis.py?

from streusle.

nschneid commented on June 30, 2024

Spec for tupdate.py:

INPUT: streusle.json edited_tquery_output_tsv

Make sure the 2 header rows are present in edited_tquery_output_tsv
- The first header row of edited_tquery_output_tsv contains the commit hash. Warn if that does not match the current git commit hash (this could indicate that the data has been modified since tquery.py was run).
- The second header row of edited_tquery_output_tsv specifies the column headers. Ensure _sentid and _tokoffset are present, and at least one of {ss, ss2, lexcat}.
Check for edits to prohibited fields
- For each row and field: compare against the JSON (or original tquery output file?) to see if changes have been made. If the field is anything other than ss, ss2, or lexcat, throw an error.
Implement token edits to ss, ss2, and lexcat by updating the JSON data structure. Do not validate these fields as other scripts will do that. Regenerate lextag accordingly.
Print an updated JSON (to be converted back to conllulex).

from streusle.

nschneid commented on June 30, 2024

For modifying MWEs as well as tags, it would be nice to be able to edit an inline format of the kind produced by streusvis.py, but enhanced to include lexcats.

For that, we need to be able to parse the format. mwerender.py defines render(); we need to add unrender(). It could work as follows:

Signature:

def unrender(rendered, toks):
    assert not any((not t) or ' ' in t for t in toks)
    ...
    return bio_tagging

Given the sentence with MWE and tag markup, construct a regex to identify which characters belong to tokens and which are markup. As we know the tokens, we can avoid assumptions about their characters (they may contain _, ~, and |).

if len(toks)==1:
    reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?)'
elif len(toks)==2: # no gaps allowed
    reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?[ ~]|_)'
               rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
else:
    reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?( |~ ?)|_ ?)'
    for i in range(1,len(toks)-2):
        reMarkup += rf'(?P<t{i}>{re.escape(toks[i])})((?P<T{i}>\|[^ _~]+)?( |~ ?| [~_])|_ ?)'
    reMarkup += rf'(?P<t{len(toks)-2}>{re.escape(toks[-2])})'
                rf'((?P<T{len(toks)-2}>\|[^ _~]+)?( | ?~| _)|_)'
                rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
matches = re.match(reMarkup, rendered)
if not matches:
    raise ValueError(f'Invalid markup: {rendered}')
groups = matches.groupdict()   # regex named groups, not MWE groups
# Groups t0, t1, ..., tn match the tokens
# Groups T0, T1, ..., Tn match the supersense/lexcat tags where present
# Everything else is markup. Note that this does not fully validate the markup; unclosed gaps are allowed, and labels on strong MWEs are optional.

For each token as it occurs in the rendered string, look at the characters immediately left and right (ignoring the tag if present) to determine the appropriate BIO tag:

bio_tagging = []
for i in range(len(toks)):
    # l, r = MWE markup/spaces on left and right
    if i==0: l = '^'
    else:
        l = rendered[matches.end(f'T{i-1}' if f'T{i-1}' in groups else f't{i-1}'):matches.start(f't{i}')]

    if i==len(toks)-1: r = '$'
    else:
        r = rendered[matches.end(f'T{i}' if f'T{i}' in groups else f't{i}'):matches.start(f't{i+1}')]

    assert l in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '^'}
    assert r in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '$'}
    ingap = False
    if i>0 and l=='_': tag = 'i_' if ingap else 'I_'
    elif i>0 and l=='~': tag= 'i~' if ingap else 'I~'
    elif i>0 and l==' _': assert ingap=='_'; ingap = False; tag = 'I_'
    elif i>0 and l==' ~': assert ingap=='~'; ingap = False; tag = 'I~'
    elif r==' ' or r=='$': tag = 'o' if ingap else 'O'
    else: tag = 'b' if ingap else 'B'
     
    bio_tagging.append(tag)
    if r=='_ ' or r=='~ ': assert not ingap; ingap = r.strip()
assert not ingap

Generate the labeled BIO tags and infer the rest of the JSON from that. For any sentence where the lextags have changed:
1. Convert the JSON for the sentence to conllulex. Requires updating json2conllulex.py so the conversion can happen without an actual file.
2. Strip out the lexical semantic analyses. Requires an update to conllulex2UDlextag.py.
3. Substitute the modified lextags in the last column.
4. Convert the modified UDlextag to JSON. Requires an update to UDlextag2json.py.
5. Re-render the sentence to make sure it matches what the user specified.

from streusle.

User-friendly concordance format and token update script about streusle HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent