Comments (12)
Hi @aggarwalpiush :)
The issue is that cassis is meant to be a generic type-system agnostic library. I.e. it should support any UIMA type system. In fact, we have users which use e.g. the cTAKES type system and may not work at all with DKPro Core. So we would need some way of
@jcklie and I have beed throwing around a number of ideas, e.g.
- passing a strategy to the constructor of the CAS constructor which would monkey-patch the CAS instance and add convenience methods:
cas = CAS(DKPro_Core); cas.get_tokens()
- but there would be no IDE auto-completion support - using some kind generic typing, e.g.
cas = CAS(); cas.$(DKPro_Core).get_tokens()
- where$
would be a method returning the type passed to it as an argument; but apparently Python doesn't support this kind of trick (Java does) and there would be no auto-completion in the IDE - an extension mechanism like Pandas has it; but again no auto-complete support
- simply using static functions:
import dkpro_core.accessors; get_tokens(cas)
- at least some IDE auto-complete support, but not necessarily a nice API - subclassing the CAS:
cas = DKProCoreCAS()
- has IDE auto-complete support, but honestly I don't like it because IMHO it doesn't separate concerns sufficiently. E.g. what if you want to use a CAS object with different type systems, e.g. DKPro Core plus you own type system. Nah... - wrapping the CAS with an accessor which implements the same interface as the CAS:
cas = DKPro_Core(CAS()); cas.get_tokens()
- has IDE auto-completion support and also you could wrap the same CAS object with different accessors if you wanted to work with multiple type systems
... so the wrapper approach seems to us the most promising one for the moment. Also, cassis doesn't need to be extended to support it.
That said ...
cas.select(TOKEN).as_text()
This is something which I think would be really nice to have.
from dkpro-cassis.
Hi @aggarwalpiush :)
The issue is that cassis is meant to be a generic type-system agnostic library. I.e. it should support any UIMA type system. In fact, we have users which use e.g. the cTAKES type system and may not work at all with DKPro Core. So we would need some way of
@jcklie and I have beed throwing around a number of ideas, e.g.
- passing a strategy to the constructor of the CAS constructor which would monkey-patch the CAS instance and add convenience methods:
cas = CAS(DKPro_Core); cas.get_tokens()
- but there would be no IDE auto-completion support - using some kind generic typing, e.g.
cas = CAS(); cas.$(DKPro_Core).get_tokens()
- where$
would be a method returning the type passed to it as an argument; but apparently Python doesn't support this kind of trick (Java does) and there would be no auto-completion in the IDE - an extension mechanism like Pandas has it; but again no auto-complete support
- simply using static functions:
import dkpro_core.accessors; get_tokens(cas)
- at least some IDE auto-complete support, but not necessarily a nice API - subclassing the CAS:
cas = DKProCoreCAS()
- has IDE auto-complete support, but honestly I don't like it because IMHO it doesn't separate concerns sufficiently. E.g. what if you want to use a CAS object with different type systems, e.g. DKPro Core plus you own type system. Nah... - wrapping the CAS with an accessor which implements the same interface as the CAS:
cas = DKPro_Core(CAS()); cas.get_tokens()
- has IDE auto-completion support and also you could wrap the same CAS object with different accessors if you wanted to work with multiple type systems
... so the wrapper approach seems to us the most promising one for the moment. Also, cassis doesn't need to be extended to support it.
That said ...
cas.select(TOKEN).as_text()
This is something which I think would be really nice to have.
from dkpro-cassis.
Wouldn't that also be type system specific?
cas.select(TOKEN).as_text() # token.getCoveredText()
cas.select(LEMMA).as_text() # lemma.getValue()
from dkpro-cassis.
If we imagine TOKEN and LEMMA to be type name string constants - no.
from dkpro-cassis.
How would cassis know what feature to use for as_text()
?
from dkpro-cassis.
In Python, one would normally just use a list comprehension for that, e.g.
values = [x.value for x in cas.select(LEMMA)]
from dkpro-cassis.
For as_text()
, we would use get_covered_text()
, not a feature value.
from dkpro-cassis.
This would somewhat diminish the usefulness, as many types beyond token would not return useful results. If we use an accessor, couldn't it decide to return different feature values depending on the type?
from dkpro-cassis.
It probably could, but it could be confusing. E.g. if as_text()
returns the covered text for tokens but say the entity type for entities, I would find that confusing. How would I get the covered text of an entity? If you wanted to introduce a convenience accessor for "the most commonly used feature value", I would find it sensible for it to have a different name, e.g. as_value()
- this could e.g. return the "value" feature for named entities (instead of the "identifier" feature) or the "PosValue" feature for POS tags (instead of the "CoarseValue").
from dkpro-cassis.
- There should be a way to access feature values of annotations.
- I would find it confusing if
cas.select(TOKEN).as_text()
andcas.select(POS).as_text()
would return the same values (as they would do now, right?)
from dkpro-cassis.
There is a way to access feature values, e.g. as @jcklie illustrated:
values = [x.value for x in cas.select(LEMMA)]
x.value
reads the feature value
on the feature structure x
. You can also write to the feature x.value = "value"
.
Right now, as_text()
does not exist. cas.select(XXX)
returns a "Generator", i..e not a list - so evaluation is lazy. That is why we currently cannot easily add methods to it - we can also not easily figure out if the result is none-empty. We have been looking e.g. at https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.peekable or considered to return a list ... no final decision for the time being. I think it would be good if cas.select(xxx)
returned something we can define methods on - some kind of lazily evaluated iterable maybe to allow eventually mirroring the UIMAv3 select API - or at least do a Pythonista version of it.
from dkpro-cassis.
I will track the extension mechanism in #83 and the extension methods you want here so that we do not mix up the issues.
from dkpro-cassis.
Related Issues (20)
- Missing sofa references are not checked and produce invalid XMIs
- while serializing CAS to xmi: AttributeError: 'str' object has no attribute 'elements' HOT 8
- cassis 0.7.3
- Error parsing certain JSONs with embedded type system
- Use a cache for typesystem.is_instance_of() HOT 3
- GitHub Actions builds do not run due to missing Python version
- Merging type systems breaks consistency
- Unable to rely on a feature of a custom layer for annotation HOT 3
- Cas.add() should be able to accept multiple feature structures HOT 1
- Cannot deserialize from JSON Cas if child type comes before super type
- uima.tcas.DocumentAnnotation not predefined when deserializing from JSON HOT 3
- Function to rename Views
- JSON CAS parsing does not handle DocumentAnnotation properly
- Types with array range break JSON typesystem parsing
- Allow reading JSON CASes with out-of-order SofaFSes
- Specific type of array elements in element FS is not retained
- Can not add annotations to characters not right next to punctuation marks for Chinese HOT 2
- Relation creation between two entities HOT 2
- Speed up load_cas_from_xmi by improving offset_mapping and sofaString setter HOT 1
- When a type cannot be found try suggesting another type with a similar name
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dkpro-cassis.