Giter VIP home page Giter VIP logo

Comments (4)

nschneid avatar nschneid commented on June 30, 2024

Need to determine whether tquery.py can print token IDs (for updateability). Rather than displaying the full sentence as a string with the target token highlighted, maybe display the left and right context in separate columns, and cap their length. Consider showing MWE markup in the context.

from streusle.

nschneid avatar nschneid commented on June 30, 2024

Option to show full tagging in context, like in streusvis.py?

from streusle.

nschneid avatar nschneid commented on June 30, 2024

Spec for tupdate.py:

INPUT: streusle.json edited_tquery_output_tsv

  1. Make sure the 2 header rows are present in edited_tquery_output_tsv
    • The first header row of edited_tquery_output_tsv contains the commit hash. Warn if that does not match the current git commit hash (this could indicate that the data has been modified since tquery.py was run).
    • The second header row of edited_tquery_output_tsv specifies the column headers. Ensure _sentid and _tokoffset are present, and at least one of {ss, ss2, lexcat}.
  2. Check for edits to prohibited fields
    • For each row and field: compare against the JSON (or original tquery output file?) to see if changes have been made. If the field is anything other than ss, ss2, or lexcat, throw an error.
  3. Implement token edits to ss, ss2, and lexcat by updating the JSON data structure. Do not validate these fields as other scripts will do that. Regenerate lextag accordingly.
  4. Print an updated JSON (to be converted back to conllulex).

from streusle.

nschneid avatar nschneid commented on June 30, 2024

For modifying MWEs as well as tags, it would be nice to be able to edit an inline format of the kind produced by streusvis.py, but enhanced to include lexcats.

For that, we need to be able to parse the format. mwerender.py defines render(); we need to add unrender(). It could work as follows:

Signature:

def unrender(rendered, toks):
    assert not any((not t) or ' ' in t for t in toks)
    ...
    return bio_tagging
  1. Given the sentence with MWE and tag markup, construct a regex to identify which characters belong to tokens and which are markup. As we know the tokens, we can avoid assumptions about their characters (they may contain _, ~, and |).

    if len(toks)==1:
        reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?)'
    elif len(toks)==2: # no gaps allowed
        reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?[ ~]|_)'
                   rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
    else:
        reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?( |~ ?)|_ ?)'
        for i in range(1,len(toks)-2):
            reMarkup += rf'(?P<t{i}>{re.escape(toks[i])})((?P<T{i}>\|[^ _~]+)?( |~ ?| [~_])|_ ?)'
        reMarkup += rf'(?P<t{len(toks)-2}>{re.escape(toks[-2])})'
                    rf'((?P<T{len(toks)-2}>\|[^ _~]+)?( | ?~| _)|_)'
                    rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
    matches = re.match(reMarkup, rendered)
    if not matches:
        raise ValueError(f'Invalid markup: {rendered}')
    groups = matches.groupdict()   # regex named groups, not MWE groups
    # Groups t0, t1, ..., tn match the tokens
    # Groups T0, T1, ..., Tn match the supersense/lexcat tags where present
    # Everything else is markup. Note that this does not fully validate the markup; unclosed gaps are allowed, and labels on strong MWEs are optional.
  2. For each token as it occurs in the rendered string, look at the characters immediately left and right (ignoring the tag if present) to determine the appropriate BIO tag:

    bio_tagging = []
    for i in range(len(toks)):
        # l, r = MWE markup/spaces on left and right
        if i==0: l = '^'
        else:
            l = rendered[matches.end(f'T{i-1}' if f'T{i-1}' in groups else f't{i-1}'):matches.start(f't{i}')]
    
        if i==len(toks)-1: r = '$'
        else:
            r = rendered[matches.end(f'T{i}' if f'T{i}' in groups else f't{i}'):matches.start(f't{i+1}')]
    
        assert l in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '^'}
        assert r in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '$'}
        ingap = False
        if i>0 and l=='_': tag = 'i_' if ingap else 'I_'
        elif i>0 and l=='~': tag= 'i~' if ingap else 'I~'
        elif i>0 and l==' _': assert ingap=='_'; ingap = False; tag = 'I_'
        elif i>0 and l==' ~': assert ingap=='~'; ingap = False; tag = 'I~'
        elif r==' ' or r=='$': tag = 'o' if ingap else 'O'
        else: tag = 'b' if ingap else 'B'
         
        bio_tagging.append(tag)
        if r=='_ ' or r=='~ ': assert not ingap; ingap = r.strip()
    assert not ingap
  3. Generate the labeled BIO tags and infer the rest of the JSON from that. For any sentence where the lextags have changed:

    1. Convert the JSON for the sentence to conllulex. Requires updating json2conllulex.py so the conversion can happen without an actual file.
    2. Strip out the lexical semantic analyses. Requires an update to conllulex2UDlextag.py.
    3. Substitute the modified lextags in the last column.
    4. Convert the modified UDlextag to JSON. Requires an update to UDlextag2json.py.
    5. Re-render the sentence to make sure it matches what the user specified.

from streusle.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.