dkpro / dkpro-cassis Goto Github PK

View Code? Open in Web Editor NEW

84.0 6.0 22.0 624 KB

UIMA CAS processing library written in Python

Home Page: https://pypi.org/project/dkpro-cassis/

License: Apache License 2.0

Python 99.89% Makefile 0.11%

cas uima python nlp annotation

dkpro-cassis's Issues

Add version numbers

Use some library to add semantic versioning numbers and easy version bumping.

Error when reading XMI files with sofaURI

I'm trying to read one of my XMI files that I've always been able to read successfully in the past using Java UimaFit tools. I run the following code:

  ts_file = open('TypeSystem.xml', 'rb')
  type_system = load_typesystem(ts_file)

  xmi_file = open('/Users/Dima/Loyola/Data/Thyme/Xmi/ID169_clinic_496.xmi', 'rb')
  cas = load_cas_from_xmi(xmi_file, typesystem=type_system)

and I get this error:

Traceback (most recent call last):
  File "./cas.py", line 11, in <module>
    cas = load_cas_from_xmi(xmi_file, typesystem=type_system)
  File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 38, in load_cas_from_xmi
    return deserializer.deserialize(source, typesystem=typesystem)
  File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 71, in deserialize
    sofa = self._parse_sofa(elem)
  File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 188, in _parse_sofa
    return Sofa(**attributes)
TypeError: __init__() got an unexpected keyword argument 'sofaURI'

Any thoughts what's causing it? Thank you very much in advance.

Unexpected modification of existing document?

It looks like cassis is omitting existing XML namespace attributes. Also, it's rewriting xmi:id attribute of preexisting XML Nodes. Is this resulting from getting rid of some form of redundant information? Or is this a bug?

Take a look at the attached documents

prediction.zip

Fix sorting of annotations without begin and end

Right now, all annotations are stored in a sorted list ordered by (begin, end). This enables us to use binary search when looking for covered annotations. There is one list for each annotation type. Some annotations do not have begin or end, these are currently hacked to be sortable by using (sys.maxsize, sys.maxsize) as a key. It should be made so that only annotations are put into sorted lists that have (begin, end), the others should use an ordinary list. These should also be selectable with cas.select.

Add basic sofa/view support

Annotations should be added to a CAS/sofa, instead of specifying sofa as a parameter.
Allow getting sofas from a CAS
Documentation

Error if supertype has no namespace

I run into a KeyError while loading a type system (see attachment) exported by IBM Watson Explorer:

Traceback (most recent call last):
  File "C:/Users/x/PycharmProjects/dkpro-cassis/tests/load_wex_ts.py", line 12, in <module>
    typeSystem = load_typesystem(ts_descriptor_file)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 538, in load_typesystem
    return deserializer.deserialize(source)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 632, in deserialize
    t = types[type_name]
KeyError: 'OpinionPhrase'

Process finished with exit code 1

This error occurs while processing a type named OpinionPhrase, which does not have namespace.
It does not occur when I insert this code as lines 572 and 573:

            if "." not in supertypeName:
                supertypeName = "uima.noNamespace." + supertypeName

However, this only means I get a second error later on. I don't know if it is related, but I describe it here anyway.

Traceback (most recent call last):
  File "C:/Users/x/PycharmProjects/dkpro-cassis/tests/load_wex_ts.py", line 12, in <module>
    typeSystem = load_typesystem(ts_descriptor_file)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 538, in load_typesystem
    return deserializer.deserialize(source)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 641, in deserialize
    t, name=f.name, rangeTypeName=f.rangeTypeName, elementType=f.elementType, description=f.description
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 480, in add_feature
    type_.add_feature(feature)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 242, in add_feature
    self.__attrs_post_init__()
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 175, in __attrs_post_init__
    self._constructor = attr.make_class(name, fields, bases=(FeatureStructure,), slots=True, eq=False, order=False)
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 2129, in make_class
    return _attrs(these=cls_dict, **attributes_arguments)(type_)
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 1002, in wrap
    builder.add_init()
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 689, in add_init
    self._is_exc,
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 1351, in _make_init
    bytecode = compile(script, unique_filename, "exec")
  File "<attrs generated init cassis.typesystem.com_ibm_es_tt_DocumentMetaData-19>", line 1
SyntaxError: duplicate argument 'self' in function definition

Process finished with exit code 1

The script it is trying to compile looks like this:

def __init__(self, xmiID=attr_dict['xmiID'].default, crawlspaceId=attr_dict['crawlspaceId'].default, crawlerId=attr_dict['crawlerId'].default, dataSource=attr_dict['dataSource'].default, dataSourceName=attr_dict['dataSourceName'].default, docType=attr_dict['docType'].default, encoding=attr_dict['encoding'].default, date=attr_dict['date'].default, url=attr_dict['url'].default, httpcode=attr_dict['httpcode'].default, documentId=attr_dict['documentId'].default, title=attr_dict['title'].default, deleteDocument=attr_dict['deleteDocument'].default, volatileURL=attr_dict['volatileURL'].default, associations=attr_dict['associations'].default, tags=attr_dict['tags'].default, type=attr_dict['type'].default, self=attr_dict['self'].default):
    self.xmiID = xmiID
    self.crawlspaceId = crawlspaceId
    self.crawlerId = crawlerId
    self.dataSource = dataSource
    self.dataSourceName = dataSourceName
    self.docType = docType
    self.encoding = encoding
    self.date = date
    self.url = url
    self.httpcode = httpcode
    self.documentId = documentId
    self.title = title
    self.deleteDocument = deleteDocument
    self.volatileURL = volatileURL
    self.associations = associations
    self.tags = tags
    self.type = type
    self.self = self

I'm working with conda 4.7.12 and python 3.7.4 on Windows 10.

exported_typesystem.zip

High level helper functions for extracting typesystem independent annotations

Hi,

We are trying to extract token's text_strings and pos tags from cas objects. Also, different type systems lead to return different pos tags formats. @zesch Please correct me here if I am wrong.
It would be great to have some helper functions (some are shown in the below examples) that could solve these requests.

For example for the given cas object:

To return all the token texts:

 cas.get_token_strings() or cas.select(TOKEN).as_text()

To return all pos tags with ptb pos tag format:

cas.select(TOKEN).get_pos_tags(format='ptb')

We hope to see these helper functions as part of this API.

Thanks!!

Rename `Annotation` to `FeatureStructure`

When I started this project, I thought everything in a CAS is an annotation. That is not the case, the basis should be feature structure. In order to fix the naming, it is necessary to rename class and argument names.

Add get_covered_text method to Annotations

Is your feature request related to a problem? Please describe.
Right now, in order to get the covered text, one has to call cas.get_covered_text(annotation). That is not nice and different from Java.

Describe the solution you'd like
Use annotation.get_covered_text()

Additional context
If the annotation is a feature structure, i.e. does not cover text, an exception should be thrown.

Get PEP8 compliant

Adhere to the Style Guide for Python Code

Update README with package information

Add badges
Specify Python version in setup.py

Release 0.1.1 and publish it on pypi

It would be nice to have a version 0.1.0 to be released

Add more cas select methods

Is your feature request related to a problem? Please describe.
It would be nice to add more cas select methods like CasUtil does.

Describe the solution you'd like

Support feature structures referencing other feature structures

Describe the bug

We have an annotation type that has a feature structure that is referenced via a feature/attribute, but the feature structure is not being displayed in the loaded CAS object.

To Reproduce

I am able to query the annotation type just fine, but the feature structure does not show up when I do a load_cas_from_xmi

Annotation type in question is: textsem:MedicationMention (or ProcedureMention, etc) and is linked to refsem:UmlsConcept via ontologyConceptArr.

I've tried:

print([x for x in cas.select('org.apache.ctakes.typesystem.type.refsem.UmlsConcept')])
print([x for x in cas.select_all()])

and

view = cas.get_view('_InitialView')
print([x for x in view.select_all()])

all to no avail.

Expected behavior

Expect to see refsem:UmlsConcept in the displayed CAS.

Please complete the following information:

Version: [0.2.0.dev0]
OS: OS X

Additional context

Attached are XMI and Typesystem example:
ctakes_out.zip

while deserializing xmi: AttributeError ... object has no attribute 'sofa

deserializing xmi with myType and its supertypeName=uima.cas.TOP
an AttributeError: myType object has no attribute 'sofa'
...
/cassis/cas.py", line 169, in add_annotation
annotation.sofa = self.get_sofa().xmiID

is raised.

(Please note that processing xmi in java is not an issue)

Implementing support for array features

I got the BioMedICUS annotations working with cassis.

Now, on to MetaMap. These have an odd way of representing arrays. For example, StringArray is represented as

        <cas:StringArray xmi:id="1373">
		<elements>CSP</elements>
		<elements>LCH</elements>
		<elements>LCH_NW</elements>
		<elements>LNC</elements>
		<elements>MSH</elements>
		<elements>MTH</elements>
		<elements>NCI</elements>
		<elements>NCI_CDISC</elements>
		<elements>NCI_FDA</elements>
		<elements>NCI_NICHD</elements>
		<elements>SNMI</elements>
		<elements>SNOMEDCT_US</elements>
	</cas:StringArray>

When processing the XMI, this throws the error that:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-94fafef6801b> in <module>()
     20 # 528715 737-v1
     21 with open(dir_test + '737-v1.txt.xmi', 'rb') as f:
---> 22     cas = load_cas_from_xmi(f, typesystem=typesystem)
     23     #print(cas.sofas)
     24     #print(dir(cas))

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in load_cas_from_xmi(source, typesystem)
     36         return deserializer.deserialize(BytesIO(source.encode("utf-8")), typesystem=typesystem)
     37     else:
---> 38         return deserializer.deserialize(source, typesystem=typesystem)
     39 
     40 

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in deserialize(self, source, typesystem)
     71                 views[proto_view.sofa] = proto_view
     72             else:
---> 73                 annotation = self._parse_annotation(typesystem, elem)
     74                 annotations[annotation.xmiID] = annotation
     75 

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in _parse_annotation(self, typesystem, elem)
    120         typename = elem.tag[9:].replace("/", ".").replace("ecore}", "")
    121 
--> 122         AnnotationType = typesystem.get_type(typename)
    123         attributes = dict(elem.attrib)
    124 

/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in get_type(self, typename)
    336             return self._types[typename]
    337         else:
--> 338             raise Exception("Type with name [{0}] not found!".format(typename))
    339 
    340     def get_types(self) -> Iterator[Type]:

Exception: Type with name [] not found!

Any suggestions on how to deal with this? Again, the NLP annotator is not in our control, so the data are what they are, for good or bad.

Thanks!

Broken annotation type

Removing uima.tcas.DocumentAnnotation from _types property in the TypeSystem class in typesystem.py breaks load_cas_from_xmi.

This code was removed in a previous commit:

# DocumentAnnotation
t = self.create_type(name='uima.tcas.DocumentAnnotation', supertypeName='uima.tcas.Annotation')
self.add_feature(t, name='language', rangeTypeName='uima.cas.String')

Readd DocumentAnnotation as predefined type #41
Support types with no namespace #43

Circular annotations

Spoke too soon. Both cTAKES and Clamp both have circular annotations (or linked, if you want to get pedantic).

Getting the following error:

---------------------------------------------------------------------------
CircularDependencyError                   Traceback (most recent call last)
<ipython-input-4-ff102eb39ae5> in <module>()
      7 dir_test = '/Users/gms/development/nlp/nlpie/data/medinfo/ctakes_out/'
      8 with open(dir_test + 'TypeSystem.xml', 'rb') as f:
----> 9     typesystem = load_typesystem(f)
     10     #print(dir(typesystem))
     11     #print(typesystem.get_types())

/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in load_typesystem(source)
    425         return deserializer.deserialize(BytesIO(source.encode("utf-8")))
    426     else:
--> 427         return deserializer.deserialize(source)
    428 
    429 

/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in deserialize(self, source)
    482 
    483         ts = TypeSystem()
--> 484         for type_name in toposort_flatten(dependencies, sort=False):
    485             # No need to create predefined types
    486             if type_name in PREDEFINED_TYPES:

/anaconda3/lib/python3.7/site-packages/toposort.py in toposort_flatten(data, sort)
     88 
     89     result = []
---> 90     for d in toposort(data):
     91         result.extend((sorted if sort else list)(d))
     92     return result

/anaconda3/lib/python3.7/site-packages/toposort.py in toposort(data)
     79                     if item not in ordered}
     80     if len(data) != 0:
---> 81         raise CircularDependencyError(data)
     82 
     83 

CircularDependencyError: Circular dependencies exist among these items: {'org.apache.ctakes.typesystem.type.textsem.SemanticArgument':{'org.apache.ctakes.typesystem.type.textsem.SemanticRoleRelation'}, 'org.apache.ctakes.typesystem.type.textsem.SemanticRoleRelation':{'org.apache.ctakes.typesystem.type.textsem.SemanticArgument'}}

These are definitely legit, and can be viewed in the UIMA CVD.

Any ideas on how to implement, and I will do the work. Again, I could just remove these annotations, especially since we are not interested in them, but when dealing with 50 million x 2 files, that is a lot of processing time.

Worst case, we can modify the AE config file to not print these as output.

Add support for DKPro type system and annotations

it would be nice to be able to initialize a CAS with a certain type system, e.g.

from somewhere import DKProCoreTypeSystem
from cassis import Cas

cas = Cas(DKProCoreTypeSystem())

In order to verify that the DKPro typeystem is supported by cassis, the typesystem xml and an example CAS should be added to the tests.

Allow Python 3.5

Right now, attribute typing is used which is a feature of >= Python3.6:

@attr.s(slots=True)
class Sofa:
    """Each CAS has one or more Subject of Analysis (SofA)"""

    sofaNum: int = attr.ib()  #: The sofaNum

This is the difference between Python 3.5 and Python 3.6 support in cassis from what I saw. It is possible to annotate these with comments instead syntax. This should be done in order to support python 3.5.

Take type system into account when adding annotations to CAS

Right now, the CAS is not aware of the type system. That means, it is possible to add types to a CAS that are not in the type system. It should be the default to validate types to allow only adding types that are in the type system.

Add Coverage

In order to monitor test coverage, add Coverage integration.

Loading a type system in which a child type redefines a parent type throws.

Describe the bug
Loading a type system in which a child type redefines a parent type throws.

To Reproduce
Steps to reproduce the behavior:

Loading a type system in which a child type redefines a parent type.
See the error

Expected behavior
No error

Support type inheritance

Types in an UIMA type system can inherit from other types. This can be specified via the supertypeName tag. This inheritance needs to be supported, so that a type inherits features from its super type.

Handle redefining of predefined types

Some CAS handling frameworks redefine predefined types like DocumentAnnotation. In order to not throw the error that the type already exists, it would be nice to check whether they are exactly the same type and then ignore it. Also, it would be nice to also redefine this type on export again to keep the type system as is.

Add different docs for last release and master

Add two documentation links, one for master and one for the last release.

Add proper DocumentMetaData support

Right now, DocumentMetaData is treated as an ordinary annotation. It would be better to parse it specially right into fields that can be queried from the CAS. This has several reason:

It should only exist once. Assuming it is always the first annotation makes implementation easier.
It is normally serialized as the first annotation. Adding extra handling makes it easier, because the other annotations are sorted by type name which is contradicted by DocumentMetaData when used as an ordinary annotation.
It is a predefined type, but sometimes not defined in the type system. Some extra care is needed to make sure it is only defined once but also serialized.

XMI serialising throws if CAS contains reference to empty array

Describe the bug
I exported a simple test case from WebAnno 3.6 (no custom tags, only the default ones from WebAnno), which I read using cassis. Trying to serialise it again results in an error.

To Reproduce

from cassis import *

path = "path/to/test_case/"

with open(path + '/TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open(path + '/test_file.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

serialised = cas.to_xmi(path=None)

If possible, attach a minimal file that triggers the error.
webanno_test_case.zip

Expected behavior
Serialised file returned without errors.

Error message

  File "<ipython-input-5-cf0db0819547>", line 12, in <module>
    serialised= cas.to_xmi(path=None)
  File "..../dkpro-cassis/cassis/cas.py", line 333, in to_xmi
    serializer.serialize(sink, self, pretty_print=pretty_print)
  File ".../dkpro-cassis/cassis/xmi.py", line 243, in serialize
    for annotation in sorted(self._find_all_fs(cas), key=lambda a: a.xmiID):
  File ".../dkpro-cassis/cassis/xmi.py", line 276, in _find_all_fs
    for referenced_fs in lst:
TypeError: 'NoneType' object is not iterable

Please complete the following information:

Version: 0.2.1
OS: Ubuntu 18 and MacOS

Additional context
Similar situation to #64 but with a different test case.

Reading xmi then writing again results in broken xmi file.

If I load the attached xmi via load_cas_from_xmi and then use to_xmi without changing anything in the cas, INCePTION can not read the new xmi file.

Bundestag_08-7.zip

To Reproduce
Steps to reproduce the behavior:

with open('TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('Bundestag_08-7.txt.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

cas.to_xmi('test_cas.xmi', pretty_print=True)

Then import test_cas.xmi into INCEpTION

Expected behavior
I would expect Inception to import the resulting xmi file, just as it did with the original xmi file.

Error message
Error while uploading document test_cas.xmi: XCASParsingException: Error parsing XMI-CAS from source at line -1, column -1: xmi id 1459 is referenced but not defined.

Please complete the following information:

Version: 0.2.0rc2
OS: Ubuntu

Add extensions for use cases from DKPro Core and cTAKES to the CAS interface

it would be nice to be able to initialize a CAS with a certain type system, e.g.

from somewhere import DKProCoreTypeSystem
from cassis import Cas

cas = Cas(DKProCoreTypeSystem())

Support types with no namespace

Is your feature request related to a problem? Please describe.
Some UIMA CAS handling frameworks define type systems with types that have no namespace, e.g.

    <typeDescription>
        <name>ArtifactID</name>
        <description>A unique artifact identifier.</description>
        <supertypeName>uima.cas.TOP</supertypeName>
        <features>
            <featureDescription>
                <name>artifactID</name>
                <description>A unique identification string for the artifact. This should be the file name for files,
        or the unique identifier used in a database if the document source is a database
        collection reader.</description>
                <rangeTypeName>uima.cas.String</rangeTypeName>
            </featureDescription>
        </features>
    </typeDescription>

Describe the solution you'd like
When importing these types, prefix their name with uima.noNamespace. This is apparently how it was done in the past.

Describe alternatives you've considered
Not supporting this, but it does not cost much and supports old systems.

Improve package import structure

I would simplify the current package import structure (in the __init__.py):

cassis contains all things like Cas, TypeSystem, Sofa, Feature, Type
cassis.xmi contains loader/saver for XMI
cassis.typesystem contains loader/saver for type systems

Swap order of assert arguments in unit tests

When doing asserts in unit tests with pytest, the order should be assert actual == expected. Right now, the order used is assert expected == actual which needs to be changed in order to make the error message correct.

Release 0.2.0

Release 0.2.0 containing several improvements and bug fixes.

Add function to select first covering annotation

Is your feature request related to a problem? Please describe.
It is annoying to get the first covering annotation e.g. of a sentence.

Describe the solution you'd like
Add function to select first covering annotation

Add test files from the Apache UIMA Java SDK

It would be nice to add the test files that UIMA uses to the cassis test suite.
Tests can be found under e.g. https://github.com/apache/uima-uimaj/blob/a68b57a8eaca2ee5392903d8f2f86bca5df08054/uimaj-core/src/test/java/org/apache/uima/cas/impl/XmiCasDeserializerTest.java

Files can e.g. be found under
https://github.com/apache/uima-uimaj/tree/a68b57a8eaca2ee5392903d8f2f86bca5df08054/uimaj-core/src/test/resources/ExampleCas

Parsing Error with WebAnno UIMA XMI format

Hello, cassis team,

I need to parse a XMI file, which is produced by WebAnno in Python.
I'm totally new to UIMA or XMI format, so it was lucky for me to discover Casis. Thank you developers.

Unfortunately, the code snippet provided (https://github.com/dkpro/dkpro-cassis#selecting-annotations) doesn't work well for the attached file below.

When I load the file with 'load_cas_from_xmi()' method, the 'Cas' class initialize itself with 'Cas._sofas' to be dictionary key of 1.
However, the right dictionary key for the attached example is 12.

How can I make the 'Cas' class to get the right SOFA key of 12?

Plus, is it the same as 'selecting annotations' for retrieving pos-tags that are tagged on tokens?
If Casis is not the best option for parsing the attached file, please recommend other alternatives.

thank you so much.
webanno629617483446633113export.zip

Document that you can change the typesystem after creating it

Is your feature request related to a problem? Please describe.
Advertise this somewhere because one of the things people tend find limiting about the Java implementation is that the the type system cannot be changed after the CAS has been initialized, in particular no types can be added and no features can be added to types.

Describe the solution you'd like
Write it in the readme.

Additional context
UIMA Java cannot do it and it is annoying.

Refactor Cas constructor

Compute namespaces instead of storing them
Do not take a list of annotations, but add a bulk add function
Do not take a sofa and view list, but add sofa support (see #5)

Bug: cassis cannot be loaded

I have created a new virtualenv and installed casis using following commands

>>>:~/ukp/test_new_cas$ virtualenv env --python=python3.6 
Running virtualenv with interpreter /usr/bin/python3.6
Using base prefix '/usr'
New python executable in /home/weaponxiii/ukp/test_new_cas/env/bin/python3.6
Also creating executable in /home/weaponxiii/ukp/test_new_cas/env/bin/python
Installing setuptools, pip, wheel...done.

>>>:~/ukp/test_new_cas$ source env/bin/activate

(env) >>>:~/ukp/test_new_cas$ python -m pip install git+https://github.com/dkpro/dkpro-cassis
Collecting git+https://github.com/dkpro/dkpro-cassis
  Cloning https://github.com/dkpro/dkpro-cassis to /tmp/pip-req-build-a3u2ig4q
Collecting lxml (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/03/a4/9eea8035fc7c7670e5eab97f34ff2ef0ddd78a491bf96df5accedb0e63f5/lxml-4.2.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting attrs (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
Collecting sortedcontainers (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/be/e3/a065de5fdd5849450a8a16a52a96c8db5f498f245e7eda06cc6725d04b80/sortedcontainers-2.0.5-py2.py3-none-any.whl
Collecting toposort (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: cassis
  Running setup.py bdist_wheel for cassis ... done
  Stored in directory: /tmp/pip-ephem-wheel-cache-j0ltlofg/wheels/c2/d1/6c/2b0e0b6c03d1d64f9782fb6e3b174abaaed41ff3c62b2cafa5
Successfully built cassis
Installing collected packages: lxml, attrs, sortedcontainers, toposort, cassis
Successfully installed attrs-18.2.0 cassis-0.0.1 lxml-4.2.5 sortedcontainers-2.0.5 toposort-1.5

and pip show also recongizes it

(env) >>>:~/ukp/test_new_cas$ python -m pip show cassis
Name: cassis
Version: 0.0.1
Summary: UIMA CAS processing library in Python
Home-page: https://github.com/dkpro/dkpro-cassis
Author: Jan-Christoph Klie
Author-email: [email protected]
License: Apache License 2.0
Location: /home/weaponxiii/ukp/test_new_cas/env/lib/python3.6/site-packages
Requires: attrs, lxml, toposort, sortedcontainers
Required-by:

But for some reason it can not be imported

(env) >>>:~/ukp/test_new_cas$ python
Python 3.6.6 (default, Sep 12 2018, 18:26:19) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cassis
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cassis'
>>> import imp
>>> imp.find_module('cassis')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/weaponxiii/ukp/test_new_cas/env/lib/python3.6/imp.py", line 297, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'cassis'
>>> imp.find_module('json')
(None, '/usr/lib/python3.6/json', ('', '', 5))
>>>

I have tested the same procedure on this project to install and use other packages ( eg. h5py ) successfully but so far no luck with cassis

Inconsistent xmiID

Describe the bug

xmiID is getting changed somewhere along the way.

To Reproduce

Add the following line to the end of _parse_annotations:

if attributes["xmiID"] == 52613:

            print("TEST:", attributes)

With included XMI and Typesystem files, run the following:

with open(dir_test + 'TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)
    
fname = '737-v1.txt.xmi'
with open(dir_test + fname, 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

view = cas.get_view('Analysis')

print([x for x in view.select("biomedicus.v2.UmlsConcept")])

You will see:

TEST: {'sofa': 18, 'begin': 10, 'end': 14, 'sui': 'S0628538', 'cui': 'C2740799', 'tui': 'T121', 'source': 'MTHSPL', 'confidence': '1.0', 'xmiID': 52613}
[biomedicus_v2_UmlsConcept(xmiID=9826, sui='S0628538', cui='C2740799', tui='T121', source='MTHSPL', confidence='1.0', begin=10, end=14, sofa=6, type='biomedicus.v2.UmlsConcept')...]

The xmiIDs are not the same (they should be, since they are the first set of annotations for this type).

Expected behavior

Expect the output xmiID to not change.

I suspect this has something to do with the AnnotationBase class definition, but have not had a chance to examine this further.

Please complete the following information:

Version: 0.2.0.dev0
OS: MacOS

Archive.zip

Make creating type classes lazy

Is your feature request related to a problem? Please describe.
Right now, when loading a typesystem, all Python classes representing types in the type system are created, whether they will be used later or not. This is slow for large type systems.

Describe the solution you'd like
Make it possible to create a type class when creating an annotation of that type for the first time.

Creating cas without typesystem shares the default typesystem

Describe the bug
Creating cas without typesystem shares the default typesystem

To Reproduce
Steps to reproduce the behavior:

Create two casses with cas1 = Cas(); cas2 = Cas()
Add a new type to both
Error is thrown due to duplicate type

Expected behavior
Casses both have different type system

Add Read The Docs

In order to auto generate documentation on every commit, readthedocs can be used as a free service.

Add sphinx to the project
Write documentation
Add integration for readthedocs

Fix CAS XMI namespace serialization

Right now, the name spaces are repeated for every element which heavily bloats the resulting XMI string:

e.g.

<cas:View xmlns:cas="http:///uima/cas.ecore" xmlns:chunk="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/chunk.ecore" xmlns:constituent="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/constituent.ecore" xmlns:dependency="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/dependency.ecore" xmlns:morph="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/morph.ecore" xmlns:pos="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmlns:tweet="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/tweet.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/coref/type.ecore" xmlns:type2="http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore" xmlns:type3="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmlns:type4="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:type5="http:///de/tudarmstadt/ukp/dkpro/core/api/semantics/type.ecore" xmlns:type6="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type.ecore" xmlns:type7="http:///de/tudarmstadt/ukp/dkpro/core/api/transform/type.ecore" xmlns:type8="http:///de/tudarmstadt/ukp/inception/api/kb/type.ecore" xmlns:type9="http:///de/tudarmstadt/ukp/inception/recommendation/api/type.ecore" xmlns:xmi="http://www.omg.org/XMI" sofa="12" members="1 42"/>

The goal is to have the declarations just at the top, the elements itself should use prefixes and not define new namespaces.

Allow circular dependencies between ranges and supertypes

Is your feature request related to a problem? Please describe.
Right now, the dkpro-xml type system cannot be parsed, as there is a circular dependency somewhere

Circular dependencies exist among these items: 
{
    "org.dkpro.core.api.xml.type.XmlDocument": {"org.dkpro.core.api.xml.type.XmlElement"}, 
"org.dkpro.core.api.xml.type.XmlElement":{"org.dkpro.core.api.xml.type.XmlNode"}, 
"org.dkpro.core.api.xml.type.XmlNode":{"org.dkpro.core.api.xml.type.XmlElement"}, 
"org.dkpro.core.api.xml.type.XmlTextNode":{"org.dkpro.core.api.xml.type.XmlNode"}
}

Describe the solution you'd like
Build it so it works.

Add Travis CI integration

In order to let the tests run for every commit, Travis can be added as a free Continuous Integration service.

Add Pipenv
Add setup.py
Add travis integration

ValueError when reading a type system file that redefines a feature

Describe the bug

First of all, thank you very much for creating this useful tool.

Unfortunately, I stumble when I try to load the cTAKES (https://ctakes.apache.org) type system:

https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml

I get the following error:

ValueError: Feature with name [value] already exists in [org.apache.ctakes.typesystem.type.refsem.LabReferenceRange]!

To Reproduce

Save https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/ somewhere on your system and then do:

f = open('TypeSystem.xml', 'rb')
load_typesystem(f)

Expected behavior

I'm not totally sure why this error happens -- it's seems like this file (TypeSystem.xml) is a legitimate type system that's been used by cTAKES community for many years.

Please complete the following information:

I'm doing this on a Mac

Thank you very much in advance for looking into this.

Allow leniency to CAS deserialization

Is your feature request related to a problem? Please describe.
Right now, when referencing a type in a CAS that is not in the type system, an error is thrown. It would be nice to specify that it should just be ignored.

It has to be determined whether this should be for types or also for features.

Describe the solution you'd like
Add a lenient flag like uimaj does.

Checking whether a feature is a collection is wrong for some collections

Describe the bug

Had to modify this chunk of code in CasXmiDeserializer (@ line 151 in xmi.py) to make it partially work:

 # Resolve references
              # NB: need to ensure `value` is of type str, otherwise error is thrown 
              if typesystem.is_collection(feature.rangeTypeName) and isinstance(value, str):
                  # A collection of references is a list of integers separated
                  # by single spaces, e.g. <foo:bar elements="1 2 3 42" />
                  targets = []
                  
                  for ref in value.split():
                      target_id = int(ref)
                      target = feature_structures[target_id]
                      targets.append(target)
                  setattr(fs, feature_name, targets)
               
              else:
                  # NB: need to ensure `value` is of type int, otherwise error when casting as int
                  if isinstance(value, int):
                      target_id = int(value)
                      target = feature_structures[target_id]
                      setattr(fs, feature_name, target)

              # NB: `value` can be list of strings
              if not isinstance(value, str):
                  #print(" ".join(value))
                  pass

To Reproduce
Run current version of cassis on attached files with this script (attached test.zip to reproduce errors):

test.zip

from cassis import *

dir_test = `<Path to files>`
with open(dir_test + 'TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)
       
[test.zip](https://github.com/dkpro/dkpro-cassis/files/3655535/test.zip)

# add missing types
t = typesystem.create_type(name='org.apache.uima.examples.SourceDocumentInformation', supertypeName='uima.tcas.Annotation')
typesystem.add_feature(t, name='uri', rangeTypeName='uima.cas.String')
typesystem.add_feature(t, name="offsetInSource", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="documentSize", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="lastSegment", rangeTypeName="uima.cas.Integer")
    
fname = '737-v1.txt.xmi'
with open(dir_test + fname, 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

view = cas.get_view('_InitialView')
print([x for x in view.select("org.metamap.uima.ts.Candidate")])

Expected behavior

-> Expect arrays of all types to be loaded, including if it's a list.
-> Not sure how to deal with this chunk:

else:
  if isinstance(value, int):
      target_id = int(value)
      target = feature_structures[target_id]
      setattr(fs, feature_name, target)

Please complete the following information:

Version: -e git+https://github.com/dkpro/dkpro-cassis.git@f0bd4bc167ff67301e786f8437f59e6e50b57bd8#egg=dkpro_cassis
OS: OS X

dkpro / dkpro-cassis Goto Github PK

dkpro-cassis's Issues

Recommend Projects

Recommend Topics

Recommend Org