Comments (4)
Need to determine whether tquery.py can print token IDs (for updateability). Rather than displaying the full sentence as a string with the target token highlighted, maybe display the left and right context in separate columns, and cap their length. Consider showing MWE markup in the context.
from streusle.
Option to show full tagging in context, like in streusvis.py?
from streusle.
Spec for tupdate.py:
INPUT: streusle.json edited_tquery_output_tsv
- Make sure the 2 header rows are present in edited_tquery_output_tsv
- The first header row of edited_tquery_output_tsv contains the commit hash. Warn if that does not match the current git commit hash (this could indicate that the data has been modified since tquery.py was run).
- The second header row of edited_tquery_output_tsv specifies the column headers. Ensure
_sentid
and_tokoffset
are present, and at least one of {ss
,ss2
,lexcat
}.
- Check for edits to prohibited fields
- For each row and field: compare against the JSON (or original tquery output file?) to see if changes have been made. If the field is anything other than
ss
,ss2
, orlexcat
, throw an error.
- For each row and field: compare against the JSON (or original tquery output file?) to see if changes have been made. If the field is anything other than
- Implement token edits to
ss
,ss2
, andlexcat
by updating the JSON data structure. Do not validate these fields as other scripts will do that. Regeneratelextag
accordingly. - Print an updated JSON (to be converted back to conllulex).
from streusle.
For modifying MWEs as well as tags, it would be nice to be able to edit an inline format of the kind produced by streusvis.py, but enhanced to include lexcats.
For that, we need to be able to parse the format. mwerender.py defines render()
; we need to add unrender()
. It could work as follows:
Signature:
def unrender(rendered, toks):
assert not any((not t) or ' ' in t for t in toks)
...
return bio_tagging
-
Given the sentence with MWE and tag markup, construct a regex to identify which characters belong to tokens and which are markup. As we know the tokens, we can avoid assumptions about their characters (they may contain
_
,~
, and|
).if len(toks)==1: reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?)' elif len(toks)==2: # no gaps allowed reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?[ ~]|_)' rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$' else: reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?( |~ ?)|_ ?)' for i in range(1,len(toks)-2): reMarkup += rf'(?P<t{i}>{re.escape(toks[i])})((?P<T{i}>\|[^ _~]+)?( |~ ?| [~_])|_ ?)' reMarkup += rf'(?P<t{len(toks)-2}>{re.escape(toks[-2])})' rf'((?P<T{len(toks)-2}>\|[^ _~]+)?( | ?~| _)|_)' rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$' matches = re.match(reMarkup, rendered) if not matches: raise ValueError(f'Invalid markup: {rendered}') groups = matches.groupdict() # regex named groups, not MWE groups # Groups t0, t1, ..., tn match the tokens # Groups T0, T1, ..., Tn match the supersense/lexcat tags where present # Everything else is markup. Note that this does not fully validate the markup; unclosed gaps are allowed, and labels on strong MWEs are optional.
-
For each token as it occurs in the rendered string, look at the characters immediately left and right (ignoring the tag if present) to determine the appropriate BIO tag:
bio_tagging = [] for i in range(len(toks)): # l, r = MWE markup/spaces on left and right if i==0: l = '^' else: l = rendered[matches.end(f'T{i-1}' if f'T{i-1}' in groups else f't{i-1}'):matches.start(f't{i}')] if i==len(toks)-1: r = '$' else: r = rendered[matches.end(f'T{i}' if f'T{i}' in groups else f't{i}'):matches.start(f't{i+1}')] assert l in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '^'} assert r in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '$'} ingap = False if i>0 and l=='_': tag = 'i_' if ingap else 'I_' elif i>0 and l=='~': tag= 'i~' if ingap else 'I~' elif i>0 and l==' _': assert ingap=='_'; ingap = False; tag = 'I_' elif i>0 and l==' ~': assert ingap=='~'; ingap = False; tag = 'I~' elif r==' ' or r=='$': tag = 'o' if ingap else 'O' else: tag = 'b' if ingap else 'B' bio_tagging.append(tag) if r=='_ ' or r=='~ ': assert not ingap; ingap = r.strip() assert not ingap
-
Generate the labeled BIO tags and infer the rest of the JSON from that. For any sentence where the lextags have changed:
- Convert the JSON for the sentence to conllulex. Requires updating json2conllulex.py so the conversion can happen without an actual file.
- Strip out the lexical semantic analyses. Requires an update to conllulex2UDlextag.py.
- Substitute the modified lextags in the last column.
- Convert the modified UDlextag to JSON. Requires an update to UDlextag2json.py.
- Re-render the sentence to make sure it matches what the user specified.
from streusle.
Related Issues (20)
- Age construction: year_old, month_old
- Lexcat of non-verbal MWE with fixed (integral) preposition HOT 3
- Review predicative MWEs starting with BE
- govobj for stranding with a copular relative clause
- Prepositional MWEs with lexcat of ADJ or ADV HOT 1
- ADP + NOUN + ADP MWEs HOT 1
- Add missing sentence? HOT 2
- Paragraph boundaries
- false negatives for a_little
- govobj: "for over" where "over" is an Approximator HOT 1
- buy with: Instrument/Cost HOT 1
- what gui annotation tool is used for annotating streusle corpus ? HOT 1
- How do you run Streusle against a sentence to get the labels ? HOT 1
- govobj: "out-of-place" HOT 2
- Break up "belong to" MWE, mark "to" as Possessor~Goal
- pssid: consider adding some MWE heuristics
- send new release to PyPI HOT 2
- UDlextag2json.py appears not to work HOT 6
- "judge by" etc.: Instrument should be Characteristic? HOT 2
- missing import: tags2sst in pssid/identify.py HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from streusle.