dkpro / dkpro-cassis Goto Github PK
View Code? Open in Web Editor NEWUIMA CAS processing library written in Python
Home Page: https://pypi.org/project/dkpro-cassis/
License: Apache License 2.0
UIMA CAS processing library written in Python
Home Page: https://pypi.org/project/dkpro-cassis/
License: Apache License 2.0
Use some library to add semantic versioning numbers and easy version bumping.
I'm trying to read one of my XMI files that I've always been able to read successfully in the past using Java UimaFit tools. I run the following code:
ts_file = open('TypeSystem.xml', 'rb')
type_system = load_typesystem(ts_file)
xmi_file = open('/Users/Dima/Loyola/Data/Thyme/Xmi/ID169_clinic_496.xmi', 'rb')
cas = load_cas_from_xmi(xmi_file, typesystem=type_system)
and I get this error:
Traceback (most recent call last):
File "./cas.py", line 11, in <module>
cas = load_cas_from_xmi(xmi_file, typesystem=type_system)
File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 38, in load_cas_from_xmi
return deserializer.deserialize(source, typesystem=typesystem)
File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 71, in deserialize
sofa = self._parse_sofa(elem)
File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 188, in _parse_sofa
return Sofa(**attributes)
TypeError: __init__() got an unexpected keyword argument 'sofaURI'
Any thoughts what's causing it? Thank you very much in advance.
It looks like cassis is omitting existing XML namespace attributes. Also, it's rewriting xmi:id attribute of preexisting XML Nodes. Is this resulting from getting rid of some form of redundant information? Or is this a bug?
Take a look at the attached documents
Right now, all annotations are stored in a sorted list ordered by (begin, end)
. This enables us to use binary search when looking for covered annotations. There is one list for each annotation type. Some annotations do not have begin
or end
, these are currently hacked to be sortable by using (sys.maxsize, sys.maxsize)
as a key. It should be made so that only annotations are put into sorted lists that have (begin, end)
, the others should use an ordinary list. These should also be selectable with cas.select
.
I run into a KeyError while loading a type system (see attachment) exported by IBM Watson Explorer:
Traceback (most recent call last):
File "C:/Users/x/PycharmProjects/dkpro-cassis/tests/load_wex_ts.py", line 12, in <module>
typeSystem = load_typesystem(ts_descriptor_file)
File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 538, in load_typesystem
return deserializer.deserialize(source)
File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 632, in deserialize
t = types[type_name]
KeyError: 'OpinionPhrase'
Process finished with exit code 1
This error occurs while processing a type named OpinionPhrase, which does not have namespace.
It does not occur when I insert this code as lines 572 and 573:
if "." not in supertypeName:
supertypeName = "uima.noNamespace." + supertypeName
However, this only means I get a second error later on. I don't know if it is related, but I describe it here anyway.
Traceback (most recent call last):
File "C:/Users/x/PycharmProjects/dkpro-cassis/tests/load_wex_ts.py", line 12, in <module>
typeSystem = load_typesystem(ts_descriptor_file)
File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 538, in load_typesystem
return deserializer.deserialize(source)
File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 641, in deserialize
t, name=f.name, rangeTypeName=f.rangeTypeName, elementType=f.elementType, description=f.description
File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 480, in add_feature
type_.add_feature(feature)
File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 242, in add_feature
self.__attrs_post_init__()
File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 175, in __attrs_post_init__
self._constructor = attr.make_class(name, fields, bases=(FeatureStructure,), slots=True, eq=False, order=False)
File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 2129, in make_class
return _attrs(these=cls_dict, **attributes_arguments)(type_)
File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 1002, in wrap
builder.add_init()
File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 689, in add_init
self._is_exc,
File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 1351, in _make_init
bytecode = compile(script, unique_filename, "exec")
File "<attrs generated init cassis.typesystem.com_ibm_es_tt_DocumentMetaData-19>", line 1
SyntaxError: duplicate argument 'self' in function definition
Process finished with exit code 1
The script it is trying to compile looks like this:
def __init__(self, xmiID=attr_dict['xmiID'].default, crawlspaceId=attr_dict['crawlspaceId'].default, crawlerId=attr_dict['crawlerId'].default, dataSource=attr_dict['dataSource'].default, dataSourceName=attr_dict['dataSourceName'].default, docType=attr_dict['docType'].default, encoding=attr_dict['encoding'].default, date=attr_dict['date'].default, url=attr_dict['url'].default, httpcode=attr_dict['httpcode'].default, documentId=attr_dict['documentId'].default, title=attr_dict['title'].default, deleteDocument=attr_dict['deleteDocument'].default, volatileURL=attr_dict['volatileURL'].default, associations=attr_dict['associations'].default, tags=attr_dict['tags'].default, type=attr_dict['type'].default, self=attr_dict['self'].default):
self.xmiID = xmiID
self.crawlspaceId = crawlspaceId
self.crawlerId = crawlerId
self.dataSource = dataSource
self.dataSourceName = dataSourceName
self.docType = docType
self.encoding = encoding
self.date = date
self.url = url
self.httpcode = httpcode
self.documentId = documentId
self.title = title
self.deleteDocument = deleteDocument
self.volatileURL = volatileURL
self.associations = associations
self.tags = tags
self.type = type
self.self = self
I'm working with conda 4.7.12 and python 3.7.4 on Windows 10.
Hi,
We are trying to extract token's text_strings and pos tags from cas objects. Also, different type systems lead to return different pos tags formats. @zesch Please correct me here if I am wrong.
It would be great to have some helper functions (some are shown in the below examples) that could solve these requests.
For example for the given cas
object:
cas.get_token_strings() or cas.select(TOKEN).as_text()
cas.select(TOKEN).get_pos_tags(format='ptb')
We hope to see these helper functions as part of this API.
Thanks!!
When I started this project, I thought everything in a CAS is an annotation. That is not the case, the basis should be feature structure. In order to fix the naming, it is necessary to rename class and argument names.
Is your feature request related to a problem? Please describe.
Right now, in order to get the covered text, one has to call cas.get_covered_text(annotation)
. That is not nice and different from Java.
Describe the solution you'd like
Use annotation.get_covered_text()
Additional context
If the annotation is a feature structure, i.e. does not cover text, an exception should be thrown.
Adhere to the Style Guide for Python Code
It would be nice to have a version 0.1.0 to be released
Is your feature request related to a problem? Please describe.
It would be nice to add more cas select methods like CasUtil
does.
Describe the solution you'd like
Describe the bug
We have an annotation type that has a feature structure that is referenced via a feature/attribute, but the feature structure is not being displayed in the loaded CAS object.
To Reproduce
I am able to query the annotation type just fine, but the feature structure does not show up when I do a load_cas_from_xmi
Annotation type in question is: textsem:MedicationMention
(or ProcedureMention, etc) and is linked to refsem:UmlsConcept
via ontologyConceptArr
.
I've tried:
print([x for x in cas.select('org.apache.ctakes.typesystem.type.refsem.UmlsConcept')])
print([x for x in cas.select_all()])
and
view = cas.get_view('_InitialView')
print([x for x in view.select_all()])
all to no avail.
Expected behavior
Expect to see refsem:UmlsConcept
in the displayed CAS.
Please complete the following information:
Additional context
Attached are XMI and Typesystem example:
ctakes_out.zip
deserializing xmi with myType and its supertypeName=uima.cas.TOP
an AttributeError: myType object has no attribute 'sofa'
...
/cassis/cas.py", line 169, in add_annotation
annotation.sofa = self.get_sofa().xmiID
is raised.
(Please note that processing xmi in java is not an issue)
I got the BioMedICUS annotations working with cassis.
Now, on to MetaMap. These have an odd way of representing arrays. For example, StringArray
is represented as
<cas:StringArray xmi:id="1373">
<elements>CSP</elements>
<elements>LCH</elements>
<elements>LCH_NW</elements>
<elements>LNC</elements>
<elements>MSH</elements>
<elements>MTH</elements>
<elements>NCI</elements>
<elements>NCI_CDISC</elements>
<elements>NCI_FDA</elements>
<elements>NCI_NICHD</elements>
<elements>SNMI</elements>
<elements>SNOMEDCT_US</elements>
</cas:StringArray>
When processing the XMI, this throws the error that:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-1-94fafef6801b> in <module>()
20 # 528715 737-v1
21 with open(dir_test + '737-v1.txt.xmi', 'rb') as f:
---> 22 cas = load_cas_from_xmi(f, typesystem=typesystem)
23 #print(cas.sofas)
24 #print(dir(cas))
/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in load_cas_from_xmi(source, typesystem)
36 return deserializer.deserialize(BytesIO(source.encode("utf-8")), typesystem=typesystem)
37 else:
---> 38 return deserializer.deserialize(source, typesystem=typesystem)
39
40
/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in deserialize(self, source, typesystem)
71 views[proto_view.sofa] = proto_view
72 else:
---> 73 annotation = self._parse_annotation(typesystem, elem)
74 annotations[annotation.xmiID] = annotation
75
/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in _parse_annotation(self, typesystem, elem)
120 typename = elem.tag[9:].replace("/", ".").replace("ecore}", "")
121
--> 122 AnnotationType = typesystem.get_type(typename)
123 attributes = dict(elem.attrib)
124
/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in get_type(self, typename)
336 return self._types[typename]
337 else:
--> 338 raise Exception("Type with name [{0}] not found!".format(typename))
339
340 def get_types(self) -> Iterator[Type]:
Exception: Type with name [] not found!
Any suggestions on how to deal with this? Again, the NLP annotator is not in our control, so the data are what they are, for good or bad.
Thanks!
Removing uima.tcas.DocumentAnnotation
from _types
property in the TypeSystem class in typesystem.py
breaks load_cas_from_xmi
.
This code was removed in a previous commit:
# DocumentAnnotation
t = self.create_type(name='uima.tcas.DocumentAnnotation', supertypeName='uima.tcas.Annotation')
self.add_feature(t, name='language', rangeTypeName='uima.cas.String')
Spoke too soon. Both cTAKES and Clamp both have circular annotations (or linked, if you want to get pedantic).
Getting the following error:
---------------------------------------------------------------------------
CircularDependencyError Traceback (most recent call last)
<ipython-input-4-ff102eb39ae5> in <module>()
7 dir_test = '/Users/gms/development/nlp/nlpie/data/medinfo/ctakes_out/'
8 with open(dir_test + 'TypeSystem.xml', 'rb') as f:
----> 9 typesystem = load_typesystem(f)
10 #print(dir(typesystem))
11 #print(typesystem.get_types())
/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in load_typesystem(source)
425 return deserializer.deserialize(BytesIO(source.encode("utf-8")))
426 else:
--> 427 return deserializer.deserialize(source)
428
429
/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in deserialize(self, source)
482
483 ts = TypeSystem()
--> 484 for type_name in toposort_flatten(dependencies, sort=False):
485 # No need to create predefined types
486 if type_name in PREDEFINED_TYPES:
/anaconda3/lib/python3.7/site-packages/toposort.py in toposort_flatten(data, sort)
88
89 result = []
---> 90 for d in toposort(data):
91 result.extend((sorted if sort else list)(d))
92 return result
/anaconda3/lib/python3.7/site-packages/toposort.py in toposort(data)
79 if item not in ordered}
80 if len(data) != 0:
---> 81 raise CircularDependencyError(data)
82
83
CircularDependencyError: Circular dependencies exist among these items: {'org.apache.ctakes.typesystem.type.textsem.SemanticArgument':{'org.apache.ctakes.typesystem.type.textsem.SemanticRoleRelation'}, 'org.apache.ctakes.typesystem.type.textsem.SemanticRoleRelation':{'org.apache.ctakes.typesystem.type.textsem.SemanticArgument'}}
These are definitely legit, and can be viewed in the UIMA CVD.
Any ideas on how to implement, and I will do the work. Again, I could just remove these annotations, especially since we are not interested in them, but when dealing with 50 million x 2 files, that is a lot of processing time.
Worst case, we can modify the AE config file to not print these as output.
it would be nice to be able to initialize a CAS with a certain type system, e.g.
from somewhere import DKProCoreTypeSystem
from cassis import Cas
cas = Cas(DKProCoreTypeSystem())
In order to verify that the DKPro typeystem is supported by cassis, the typesystem xml and an example CAS should be added to the tests.
Right now, attribute typing is used which is a feature of >= Python3.6:
@attr.s(slots=True)
class Sofa:
"""Each CAS has one or more Subject of Analysis (SofA)"""
sofaNum: int = attr.ib() #: The sofaNum
This is the difference between Python 3.5 and Python 3.6 support in cassis from what I saw. It is possible to annotate these with comments instead syntax. This should be done in order to support python 3.5.
Right now, the CAS is not aware of the type system. That means, it is possible to add types to a CAS that are not in the type system. It should be the default to validate types to allow only adding types that are in the type system.
In order to monitor test coverage, add Coverage integration.
Describe the bug
Loading a type system in which a child type redefines a parent type throws.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
No error
Types in an UIMA type system can inherit from other types. This can be specified via the supertypeName
tag. This inheritance needs to be supported, so that a type inherits features from its super type.
Some CAS handling frameworks redefine predefined types like DocumentAnnotation
. In order to not throw the error that the type already exists, it would be nice to check whether they are exactly the same type and then ignore it. Also, it would be nice to also redefine this type on export again to keep the type system as is.
Add two documentation links, one for master and one for the last release.
Right now, DocumentMetaData
is treated as an ordinary annotation. It would be better to parse it specially right into fields that can be queried from the CAS. This has several reason:
DocumentMetaData
when used as an ordinary annotation.Describe the bug
I exported a simple test case from WebAnno 3.6 (no custom tags, only the default ones from WebAnno), which I read using cassis
. Trying to serialise it again results in an error.
To Reproduce
from cassis import *
path = "path/to/test_case/"
with open(path + '/TypeSystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
with open(path + '/test_file.xmi', 'rb') as f:
cas = load_cas_from_xmi(f, typesystem=typesystem)
serialised = cas.to_xmi(path=None)
If possible, attach a minimal file that triggers the error.
webanno_test_case.zip
Expected behavior
Serialised file returned without errors.
Error message
File "<ipython-input-5-cf0db0819547>", line 12, in <module>
serialised= cas.to_xmi(path=None)
File "..../dkpro-cassis/cassis/cas.py", line 333, in to_xmi
serializer.serialize(sink, self, pretty_print=pretty_print)
File ".../dkpro-cassis/cassis/xmi.py", line 243, in serialize
for annotation in sorted(self._find_all_fs(cas), key=lambda a: a.xmiID):
File ".../dkpro-cassis/cassis/xmi.py", line 276, in _find_all_fs
for referenced_fs in lst:
TypeError: 'NoneType' object is not iterable
Please complete the following information:
Additional context
Similar situation to #64 but with a different test case.
If I load the attached xmi via load_cas_from_xmi and then use to_xmi without changing anything in the cas, INCePTION can not read the new xmi file.
To Reproduce
Steps to reproduce the behavior:
with open('TypeSystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
with open('Bundestag_08-7.txt.xmi', 'rb') as f:
cas = load_cas_from_xmi(f, typesystem=typesystem)
cas.to_xmi('test_cas.xmi', pretty_print=True)
Then import test_cas.xmi into INCEpTION
Expected behavior
I would expect Inception to import the resulting xmi file, just as it did with the original xmi file.
Error message
Error while uploading document test_cas.xmi: XCASParsingException: Error parsing XMI-CAS from source at line -1, column -1: xmi id 1459 is referenced but not defined.
Please complete the following information:
it would be nice to be able to initialize a CAS with a certain type system, e.g.
from somewhere import DKProCoreTypeSystem
from cassis import Cas
cas = Cas(DKProCoreTypeSystem())
Is your feature request related to a problem? Please describe.
Some UIMA CAS handling frameworks define type systems with types that have no namespace, e.g.
<typeDescription>
<name>ArtifactID</name>
<description>A unique artifact identifier.</description>
<supertypeName>uima.cas.TOP</supertypeName>
<features>
<featureDescription>
<name>artifactID</name>
<description>A unique identification string for the artifact. This should be the file name for files,
or the unique identifier used in a database if the document source is a database
collection reader.</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
Describe the solution you'd like
When importing these types, prefix their name with uima.noNamespace.
This is apparently how it was done in the past.
Describe alternatives you've considered
Not supporting this, but it does not cost much and supports old systems.
I would simplify the current package import structure (in the __init__.py
):
cassis
contains all things like Cas
, TypeSystem
, Sofa
, Feature
, Type
cassis.xmi
contains loader/saver for XMIcassis.typesystem
contains loader/saver for type systemsWhen doing asserts in unit tests with pytest
, the order should be assert actual == expected
. Right now, the order used is assert expected == actual
which needs to be changed in order to make the error message correct.
Release 0.2.0 containing several improvements and bug fixes.
Is your feature request related to a problem? Please describe.
It is annoying to get the first covering annotation e.g. of a sentence.
Describe the solution you'd like
Add function to select first covering annotation
It would be nice to add the test files that UIMA uses to the cassis test suite.
Tests can be found under e.g. https://github.com/apache/uima-uimaj/blob/a68b57a8eaca2ee5392903d8f2f86bca5df08054/uimaj-core/src/test/java/org/apache/uima/cas/impl/XmiCasDeserializerTest.java
Files can e.g. be found under
https://github.com/apache/uima-uimaj/tree/a68b57a8eaca2ee5392903d8f2f86bca5df08054/uimaj-core/src/test/resources/ExampleCas
Hello, cassis team,
I need to parse a XMI file, which is produced by WebAnno in Python.
I'm totally new to UIMA or XMI format, so it was lucky for me to discover Casis. Thank you developers.
Unfortunately, the code snippet provided (https://github.com/dkpro/dkpro-cassis#selecting-annotations) doesn't work well for the attached file below.
When I load the file with 'load_cas_from_xmi()' method, the 'Cas' class initialize itself with 'Cas._sofas' to be dictionary key of 1.
However, the right dictionary key for the attached example is 12.
How can I make the 'Cas' class to get the right SOFA key of 12?
Plus, is it the same as 'selecting annotations' for retrieving pos-tags that are tagged on tokens?
If Casis is not the best option for parsing the attached file, please recommend other alternatives.
thank you so much.
webanno629617483446633113export.zip
Is your feature request related to a problem? Please describe.
Advertise this somewhere because one of the things people tend find limiting about the Java implementation is that the the type system cannot be changed after the CAS has been initialized, in particular no types can be added and no features can be added to types.
Describe the solution you'd like
Write it in the readme.
Additional context
UIMA Java cannot do it and it is annoying.
I have created a new virtualenv and installed casis using following commands
>>>:~/ukp/test_new_cas$ virtualenv env --python=python3.6
Running virtualenv with interpreter /usr/bin/python3.6
Using base prefix '/usr'
New python executable in /home/weaponxiii/ukp/test_new_cas/env/bin/python3.6
Also creating executable in /home/weaponxiii/ukp/test_new_cas/env/bin/python
Installing setuptools, pip, wheel...done.
>>>:~/ukp/test_new_cas$ source env/bin/activate
(env) >>>:~/ukp/test_new_cas$ python -m pip install git+https://github.com/dkpro/dkpro-cassis
Collecting git+https://github.com/dkpro/dkpro-cassis
Cloning https://github.com/dkpro/dkpro-cassis to /tmp/pip-req-build-a3u2ig4q
Collecting lxml (from cassis==0.0.1)
Using cached https://files.pythonhosted.org/packages/03/a4/9eea8035fc7c7670e5eab97f34ff2ef0ddd78a491bf96df5accedb0e63f5/lxml-4.2.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting attrs (from cassis==0.0.1)
Using cached https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
Collecting sortedcontainers (from cassis==0.0.1)
Using cached https://files.pythonhosted.org/packages/be/e3/a065de5fdd5849450a8a16a52a96c8db5f498f245e7eda06cc6725d04b80/sortedcontainers-2.0.5-py2.py3-none-any.whl
Collecting toposort (from cassis==0.0.1)
Using cached https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: cassis
Running setup.py bdist_wheel for cassis ... done
Stored in directory: /tmp/pip-ephem-wheel-cache-j0ltlofg/wheels/c2/d1/6c/2b0e0b6c03d1d64f9782fb6e3b174abaaed41ff3c62b2cafa5
Successfully built cassis
Installing collected packages: lxml, attrs, sortedcontainers, toposort, cassis
Successfully installed attrs-18.2.0 cassis-0.0.1 lxml-4.2.5 sortedcontainers-2.0.5 toposort-1.5
and pip show also recongizes it
(env) >>>:~/ukp/test_new_cas$ python -m pip show cassis
Name: cassis
Version: 0.0.1
Summary: UIMA CAS processing library in Python
Home-page: https://github.com/dkpro/dkpro-cassis
Author: Jan-Christoph Klie
Author-email: [email protected]
License: Apache License 2.0
Location: /home/weaponxiii/ukp/test_new_cas/env/lib/python3.6/site-packages
Requires: attrs, lxml, toposort, sortedcontainers
Required-by:
But for some reason it can not be imported
(env) >>>:~/ukp/test_new_cas$ python
Python 3.6.6 (default, Sep 12 2018, 18:26:19)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cassis
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cassis'
>>> import imp
>>> imp.find_module('cassis')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/weaponxiii/ukp/test_new_cas/env/lib/python3.6/imp.py", line 297, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'cassis'
>>> imp.find_module('json')
(None, '/usr/lib/python3.6/json', ('', '', 5))
>>>
I have tested the same procedure on this project to install and use other packages ( eg. h5py ) successfully but so far no luck with cassis
Describe the bug
xmiID
is getting changed somewhere along the way.
To Reproduce
Add the following line to the end of _parse_annotations:
if attributes["xmiID"] == 52613:
print("TEST:", attributes)
With included XMI and Typesystem files, run the following:
with open(dir_test + 'TypeSystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
fname = '737-v1.txt.xmi'
with open(dir_test + fname, 'rb') as f:
cas = load_cas_from_xmi(f, typesystem=typesystem)
view = cas.get_view('Analysis')
print([x for x in view.select("biomedicus.v2.UmlsConcept")])
You will see:
TEST: {'sofa': 18, 'begin': 10, 'end': 14, 'sui': 'S0628538', 'cui': 'C2740799', 'tui': 'T121', 'source': 'MTHSPL', 'confidence': '1.0', 'xmiID': 52613}
[biomedicus_v2_UmlsConcept(xmiID=9826, sui='S0628538', cui='C2740799', tui='T121', source='MTHSPL', confidence='1.0', begin=10, end=14, sofa=6, type='biomedicus.v2.UmlsConcept')...]
The xmiID
s are not the same (they should be, since they are the first set of annotations for this type).
Expected behavior
Expect the output xmiID
to not change.
I suspect this has something to do with the AnnotationBase class definition, but have not had a chance to examine this further.
Please complete the following information:
Is your feature request related to a problem? Please describe.
Right now, when loading a typesystem, all Python classes representing types in the type system are created, whether they will be used later or not. This is slow for large type systems.
Describe the solution you'd like
Make it possible to create a type class when creating an annotation of that type for the first time.
Describe the bug
Creating cas without typesystem shares the default typesystem
To Reproduce
Steps to reproduce the behavior:
cas1 = Cas(); cas2 = Cas()
Expected behavior
Casses both have different type system
In order to auto generate documentation on every commit, readthedocs can be used as a free service.
Right now, the name spaces are repeated for every element which heavily bloats the resulting XMI string:
e.g.
<cas:View xmlns:cas="http:///uima/cas.ecore" xmlns:chunk="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/chunk.ecore" xmlns:constituent="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/constituent.ecore" xmlns:dependency="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/dependency.ecore" xmlns:morph="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/morph.ecore" xmlns:pos="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmlns:tweet="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/tweet.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/coref/type.ecore" xmlns:type2="http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore" xmlns:type3="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmlns:type4="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:type5="http:///de/tudarmstadt/ukp/dkpro/core/api/semantics/type.ecore" xmlns:type6="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type.ecore" xmlns:type7="http:///de/tudarmstadt/ukp/dkpro/core/api/transform/type.ecore" xmlns:type8="http:///de/tudarmstadt/ukp/inception/api/kb/type.ecore" xmlns:type9="http:///de/tudarmstadt/ukp/inception/recommendation/api/type.ecore" xmlns:xmi="http://www.omg.org/XMI" sofa="12" members="1 42"/>
The goal is to have the declarations just at the top, the elements itself should use prefixes and not define new namespaces.
Is your feature request related to a problem? Please describe.
Right now, the dkpro-xml
type system cannot be parsed, as there is a circular dependency somewhere
Circular dependencies exist among these items:
{
"org.dkpro.core.api.xml.type.XmlDocument": {"org.dkpro.core.api.xml.type.XmlElement"},
"org.dkpro.core.api.xml.type.XmlElement":{"org.dkpro.core.api.xml.type.XmlNode"},
"org.dkpro.core.api.xml.type.XmlNode":{"org.dkpro.core.api.xml.type.XmlElement"},
"org.dkpro.core.api.xml.type.XmlTextNode":{"org.dkpro.core.api.xml.type.XmlNode"}
}
Describe the solution you'd like
Build it so it works.
In order to let the tests run for every commit, Travis can be added as a free Continuous Integration service.
Describe the bug
First of all, thank you very much for creating this useful tool.
Unfortunately, I stumble when I try to load the cTAKES (https://ctakes.apache.org) type system:
I get the following error:
ValueError: Feature with name [value] already exists in [org.apache.ctakes.typesystem.type.refsem.LabReferenceRange]!
To Reproduce
Save https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/ somewhere on your system and then do:
f = open('TypeSystem.xml', 'rb')
load_typesystem(f)
Expected behavior
I'm not totally sure why this error happens -- it's seems like this file (TypeSystem.xml) is a legitimate type system that's been used by cTAKES community for many years.
Please complete the following information:
I'm doing this on a Mac
Thank you very much in advance for looking into this.
Is your feature request related to a problem? Please describe.
Right now, when referencing a type in a CAS that is not in the type system, an error is thrown. It would be nice to specify that it should just be ignored.
It has to be determined whether this should be for types or also for features.
Describe the solution you'd like
Add a lenient
flag like uimaj does.
Describe the bug
Had to modify this chunk of code in CasXmiDeserializer
(@ line 151 in xmi.py
) to make it partially work:
# Resolve references
# NB: need to ensure `value` is of type str, otherwise error is thrown
if typesystem.is_collection(feature.rangeTypeName) and isinstance(value, str):
# A collection of references is a list of integers separated
# by single spaces, e.g. <foo:bar elements="1 2 3 42" />
targets = []
for ref in value.split():
target_id = int(ref)
target = feature_structures[target_id]
targets.append(target)
setattr(fs, feature_name, targets)
else:
# NB: need to ensure `value` is of type int, otherwise error when casting as int
if isinstance(value, int):
target_id = int(value)
target = feature_structures[target_id]
setattr(fs, feature_name, target)
# NB: `value` can be list of strings
if not isinstance(value, str):
#print(" ".join(value))
pass
To Reproduce
Run current version of cassis on attached files with this script (attached test.zip to reproduce errors):
from cassis import *
dir_test = `<Path to files>`
with open(dir_test + 'TypeSystem.xml', 'rb') as f:
typesystem = load_typesystem(f)
[test.zip](https://github.com/dkpro/dkpro-cassis/files/3655535/test.zip)
# add missing types
t = typesystem.create_type(name='org.apache.uima.examples.SourceDocumentInformation', supertypeName='uima.tcas.Annotation')
typesystem.add_feature(t, name='uri', rangeTypeName='uima.cas.String')
typesystem.add_feature(t, name="offsetInSource", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="documentSize", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="lastSegment", rangeTypeName="uima.cas.Integer")
fname = '737-v1.txt.xmi'
with open(dir_test + fname, 'rb') as f:
cas = load_cas_from_xmi(f, typesystem=typesystem)
view = cas.get_view('_InitialView')
print([x for x in view.select("org.metamap.uima.ts.Candidate")])
Expected behavior
-> Expect arrays of all types to be loaded, including if it's a list.
-> Not sure how to deal with this chunk:
else:
if isinstance(value, int):
target_id = int(value)
target = feature_structures[target_id]
setattr(fs, feature_name, target)
Please complete the following information:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.