Comments (37)
Do you have a typesystem xml which I can use for that?
from dkpro-cassis.
You mean for testing purposes?
I think it would be the job of the other library to provide the initializer. Although for DKPro, Cassis could also provide it directly :)
from dkpro-cassis.
The idea with the initializer is a bit more than just load type system X
.
The idea is that the initializer patches the CAS instance with additional methods, e.g.
cas.get_tokens()
cas.get_tokens_as_text()
cas.get_sentences()
cas.get_sentences_as_text()
cas.get_pos_tags()
cas.get_named_entities()
...
... and that we could e.g. have an for DKPro Core and another one for say cTAKES and both would patch the CAS with the same convenience methods but internally resorting to different select statements.
The initializer would work like a visitor, e.g.
Cas(DKProCoreTypeSystem())
triggers a call to DKProCoreTypeSystem.apply(cas))
.
@jcklie such a thing works with Python, right?
from dkpro-cassis.
We can provide these methods, I am not sure about the implementation though. My question was:Is there some official DKPro typesystem XML which I can use or can you provide me with some Java Code to generate it to keep it in sync with DKPro?
from dkpro-cassis.
The "best" solution for this would probably be to use DKPro Meta :)
from dkpro-cassis.
Well, basically what you do is create a Maven Project which has a dependency on all dkpro-core-api-**
modules and then call
TypeSystemDescription dkproCoreTS = TypeSystemDescriptionFactory
.createTypeSystemDescription();
try (FileOutputStream out = new FileOutputStream("target/dkpro-core-aggregated-ts.xml")) {
dkproCoreTS.toXML(out);
}
from dkpro-cassis.
I would implement it as extending Cas, the constructor loads the DKPro sype system then. Simple and not so magic.
from dkpro-cassis.
But then we'd end up having to import CAS from different libraries...
from dkpro-cassis.
I would add it to cassis, so it is from cassis import DKProCas
. DKPro to me is an important enough part of the UIMA world to add it to cassis itself.
from dkpro-cassis.
For the moment, I don't feel very comfortable with this. I don't like the idea of the CAS becoming something new just because it contains certain types. The idea of the CAS is that it is a generic data structure. If we subclass it for a particular framework, I feel it goes against this idea.
Actually, the strategy you have shown me OTR for the Pandas accessors looked nice. It makes very clear that there is one generic data structure and there are separately different ways of accessing it.
from dkpro-cassis.
Is it ok if we implement this in cassis or should it be part of pydkpro?
from dkpro-cassis.
I understand that IDEs may not support auto-complete for such extensions. But I wonder if IDEs like PyCharm really only do static code analysis or also consider whether a method has actually been called somewhere before. E.g. if I call method x.foo()
once and later I type y.f...
(where y
is of the same type as x
), then it would be reasonable to offer foo()
in the auto complete (without documentation at least) - I wonder if there are hints one can provide to the IDEs to fine-tune the autocomplete, e.g. for scenarios like the extension methods
suggested here.
from dkpro-cassis.
Pycharm offers some auto completion based on what was called before (the typing is limited then) and there are stub files where you can maybe add more information: https://mypy.readthedocs.io/en/latest/stubs.html . But it does not know that there is an extension, as it is added at run time (except when I just add it as a field to cassis and throw an error if it is not compatible).
from dkpro-cassis.
The idea of involving cassis came to me because I though we should/could pass the type system "strategy" to the constructor - i.e. cassis would somehow have to understand the strategy and react to it. If we use a completely different mechanism which does not require cassis to be aware of the mechanism, it could be done elsewhere.
A compromise between subtyping and adding dynamically might be a generic type (if such a thing is possible?), e.g.
cas = CAS[DKPro_Core]()
cas.access <= must return an instance of the generic type, e.g. DKPro_Core
cas.access.XXX <= IDE could theoretically know which methods the generic type provides
from dkpro-cassis.
I think we need features from Python 3.8 for that and even then I am unsure. So what we have now is:
- Use the pandas extensions style and have no type hints, let pydkpro implement this. Other people can add nice cas extensions
- Hardcode dkpro, ctakes and more as cas extensions into cassis so that we have type support. Throw an error if the Cas does not conform when using these
- Why not both
from dkpro-cassis.
I think this issue contains two things, the DKPro type system and extension. I will track the type system stuff in #9.
from dkpro-cassis.
I did some quick and dirty script to convert a typesystem XMI to Python classes for the DKPro Core type system. One can get type hints for the wrapped CAS, the accessor and does not need to redefine all cas methods:
The code basically is
class DKProAccessor:
def __init__(self, cas: Cas):
self._cas = cas
def __getattr__(self, name: str):
""" If the method is not found on the accessor, then we just delegate to the cas. """
return getattr(self._cas, name)
def get_tokens(self) -> Iterator[Token]:
return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token")
def get_named_entities(self) -> Iterator[NamedEntity]:
return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity")
def build_dkpro_cas() -> Union[Cas, DKProAccessor]:
cas = Cas(typesystem=load_dkpro_core_typesystem())
dkpro = DKProAccessor(cas)
return dkpro
I can write a decorator for init and __getattr__
so that these are added automatically to extensions.
from dkpro-cassis.
I do not know whether I want to keep the type hints for the extension methods, but I like how to define extensions.
from dkpro-cassis.
So the IDE dynamically evaluates the DKProAccessor to discover the fields?
from dkpro-cassis.
We tell the IDE that build_dkpro_cas
can either return a cas or an accessor.
from dkpro-cassis.
How does the IDE know that e.g. Token has the field form
? I don't see anything in your code that would do that?
from dkpro-cassis.
I generate type descriptions Python code from the XML. If you have a fixed type system, then you can do that and check the generate python code in your source control. I will later push the code for that; this issue should maybe focus on the extension only.
from dkpro-cassis.
Generating classes from the type system description - so a "pycasgen" - an equivalent of the "jcasgen" we have in Java which generates Java classes from the type system. Why not? :)
I think such a "pycasgen" script could be part of cassis and projects like DKPro Core or cTAKES could pre-generate the classes and push them to pypi as separate packages. WDYT?
from dkpro-cassis.
We can do that. My question right now is where to put the extensions, I like to have them in cassis itself, as they are related to CAS/XMI stuff. Also, I need them sometimes for my own code and dont want to install pydkpro just for the extensions and types.
from dkpro-cassis.
If by extensions you mean e.g. the generated types - I think these should be released separately and with the same version numbers as the corresponding DKPro Core / cTAKES / etc versions. They do not follow the same release cycle as cassis.
from dkpro-cassis.
I mean the dkpro/ctakes accessor and util functions that were requested.
from dkpro-cassis.
@zesch @aggarwalpiush WDYT? Type-system-specific accessors and Python classes generated from type systems should probably be kept together and have a release cycle mirroring the release cycle of the type system they mirror. Have them as a separate project under DKPro already now (which I think would be nice since we could already make use of them in INCEpTION)? Have them with your pipelining code later?
from dkpro-cassis.
Not sure I really understand the implications. Whatever works best on your side.
from dkpro-cassis.
I would create a new repository and Python package dkpro-typeshed where we add the extension methods and generated types to get a nice API. This would then only depend on cassis. pykdkpro then can use it to make its API nicer. We use a seperate package in order to track the dkpro version and respective types new/different types.
from dkpro-cassis.
Sounds good
from dkpro-cassis.
We have various DKPro projects and they all have different release cycles. I think the type system is generated for a particular version of a particular project. Thus having a single repo where all generated types are located doesn't seem sensible to me. We would always have to release all types at the same time and it would be impossible for users to choose a version combination they would care for. I think having a type companion repo for each DKPro project would make sense, e.g. dkpro-core-python-api
and dkpro-keyphrases-python-api
etc.
from dkpro-cassis.
This sounds like a lot of work and maintenance nightmare, right now it also works without (type unsafe in the same way the raw Java cas interface has no safety and type information). So I would then just add the accessor which returns the right FeaturesStructures but gives no IDE support, i.e. changing
def get_tokens(self) -> Iterator[Token]:
return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token")
to
def get_tokens(self) -> Iterator[FeatureStructure]:
return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token")
as a first step.
from dkpro-cassis.
This sounds like a lot of work and maintenance nightmare
What's a maintenance nightmare?
from dkpro-cassis.
Having a repo for each would mean to set up many repositories and pypi packages. I would rather not do that right now.
from dkpro-cassis.
We only need to set up one for DKPro Core. I even thought about putting the generated Python classes directly into the "dkpro-core" repository along with all the Java stuff. But considering that the Python stuff is still "young", we might care to refine/release it more often than the Java stuff, so it might have a faster release cycle (e.g. "2.0.0, then 2.0.0.1 because we fix a bug in the code generator, then 2.0.0.2 because we fix another bug, etc.").
from dkpro-cassis.
I have added a repo here and you should all have proper access to it: https://github.com/dkpro/dkpro-core-python-api
We can still rename it / move around things later if we decide to change anything. For now, we'll only create types for DKPro Core anyway.
from dkpro-cassis.
I will come back to this after the ACL deadline.
from dkpro-cassis.
Related Issues (20)
- while serializing CAS to xmi: AttributeError: 'str' object has no attribute 'elements' HOT 8
- cassis 0.7.3
- Error parsing certain JSONs with embedded type system
- Use a cache for typesystem.is_instance_of() HOT 3
- GitHub Actions builds do not run due to missing Python version
- Merging type systems breaks consistency
- Unable to rely on a feature of a custom layer for annotation HOT 3
- Cas.add() should be able to accept multiple feature structures HOT 1
- Cannot deserialize from JSON Cas if child type comes before super type
- uima.tcas.DocumentAnnotation not predefined when deserializing from JSON HOT 3
- Function to rename Views
- JSON CAS parsing does not handle DocumentAnnotation properly
- Types with array range break JSON typesystem parsing
- Allow reading JSON CASes with out-of-order SofaFSes
- Specific type of array elements in element FS is not retained
- Can not add annotations to characters not right next to punctuation marks for Chinese HOT 2
- Relation creation between two entities HOT 2
- Speed up load_cas_from_xmi by improving offset_mapping and sofaString setter HOT 1
- When a type cannot be found try suggesting another type with a similar name
- Relax dependency on attrs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dkpro-cassis.