dkpro / dkpro-cassis Goto Github PK

View Code? Open in Web Editor NEW

84.0 6.0 22.0 624 KB

UIMA CAS processing library written in Python

Home Page: https://pypi.org/project/dkpro-cassis/

License: Apache License 2.0

Python 99.89% Makefile 0.11%

cas uima python nlp annotation

dkpro-cassis's Introduction

dkpro-cassis

DKPro cassis (pronunciation: [ka.sis]) provides a pure-Python implementation of the Common Analysis System (CAS) as defined by the UIMA framework. The CAS is a data structure representing an object to be enriched with annotations (the co-called Subject of Analysis, short SofA).

This library enables the creation and manipulation of annotated documents (CAS objects) and their associated type systems as well as loading and saving them in the CAS XMI XML representation or the CAS JSON representation in Python programs. This can ease in particular the integration of Python-based Natural Language Processing (e.g. spacy or NLTK) and Machine Learning librarys (e.g. scikit-learn or Keras) in UIMA-based text analysis workflows.

An example of cassis in action is the spacy recommender for INCEpTION, which wraps the spacy NLP library as a web service which can be used in conjunction with the INCEpTION text annotation platform to automatically generate annotation suggestions.

Features

Currently supported features are:

Text SofAs
Deserializing/serializing UIMA CAS from/to XMI
Deserializing/serializing UIMA CAS from/to JSON
Deserializing/serializing type systems from/to XML
Selecting annotations, selecting covered annotations, adding annotations
Type inheritance
Multiple SofA support
Type system can be changed after loading
Primitive and reference features and arrays of primitives and references

Some features are still under development, e.g.

Proper type checking
XML/XMI schema validation

Installation

To install the package with pip, just run

pip install dkpro-cassis

Usage

Example CAS XMI and types system files can be found under tests\test_files.

Reading a CAS file

From XMI: A CAS can be deserialized from the UIMA CAS XMI (XML 1.0) format either by reading from a file or string using load_cas_from_xmi.

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xmi', 'rb') as f:
   cas = load_cas_from_xmi(f, typesystem=typesystem)

From JSON: The UIMA JSON CAS format is also supported and can be loaded using load_cas_from_json. Most UIMA JSON CAS files come with an embedded typesystem, so it is not necessary to specify one.

from cassis import *

with open('cas.json', 'rb') as f:
   cas = load_cas_from_json(f)

Writing a CAS file

To XMI: A CAS can be serialized to XMI either by writing to a file or be returned as a string using cas.to_xmi().

from cassis import *

# Returned as a string
xmi = cas.to_xmi()

# Written to file
cas.to_xmi("my_cas.xmi")

To JSON: A CAS can also be written to JSON using cas.to_json().

from cassis import *

# Returned as a string
xmi = cas.to_json()

# Written to file
cas.to_json("my_cas.json")

Creating a CAS

A CAS (Common Analysis System) object typically represents a (text) document. When using cassis, you will likely most often reading existing CAS files, modify them and then writing them out again. But you can also create CAS objects from scratch, e.g. if you want to convert some data into a CAS object in order to create a pre-annotated text. If you do not have a pre-defined typesystem to work with, you will have to define one.

typesystem = TypeSystem()

cas = Cas(
    sofa_string = "Joe waited for the train . The train was late .",
    document_language = "en",
    typesystem = typesystem)

print(cas.sofa_string)
print(cas.sofa_mime)
print(cas.document_language)

Adding annotations

Note: type names used below are examples only. The actual CAS files you will be dealing with will use other names! You can get a list of the types using cas.typesystem.get_types().

Given a type system with a type cassis.Token that has an id and pos feature, annotations can be added in the following:

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

Token = typesystem.get_type('cassis.Token')

tokens = [
    Token(begin=0, end=3, id='0', pos='NNP'),
    Token(begin=4, end=10, id='1', pos='VBD'),
    Token(begin=11, end=14, id='2', pos='IN'),
    Token(begin=15, end=18, id='3', pos='DT'),
    Token(begin=19, end=24, id='4', pos='NN'),
    Token(begin=25, end=26, id='5', pos='.'),
]

for token in tokens:
    cas.add(token)

Selecting annotations

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('cas.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

for sentence in cas.select('cassis.Sentence'):
    for token in cas.select_covered('cassis.Token', sentence):
        print(token.get_covered_text())

        # Annotation values can be accessed as properties
        print('Token: begin={0}, end={1}, id={2}, pos={3}'.format(token.begin, token.end, token.id, token.pos))

Getting and setting (nested) features

If you want to access a variable but only have its name as a string or have nested feature structures, e.g. a feature structure with feature a that has a feature b that has a feature c, some of which can be None, then you can use the following:

fs.get("var_name") # Or
fs["var_name"]

Or in the nested case,

fs.get("a.b.c")
fs["a.b.c"]

If a or b or c are None, then this returns instead of throwing an error.

Another example would be a StringList containing ["Foo", "Bar", "Baz"]:

assert lst.get("head") == "foo"
assert lst.get("tail.head") == "bar"
assert lst.get("tail.tail.head") == "baz"
assert lst.get("tail.tail.tail.head") == None
assert lst.get("tail.tail.tail.tail.head") == None

The same goes for setting:

# Functional
lst.set("head", "new_foo")
lst.set("tail.head", "new_bar")
lst.set("tail.tail.head", "new_baz")

assert lst.get("head") == "new_foo"
assert lst.get("tail.head") == "new_bar"
assert lst.get("tail.tail.head") == "new_baz"

# Bracket access
lst["head"] = "newer_foo"
lst["tail.head"] = "newer_bar"
lst["tail.tail.head"] = "newer_baz"

assert lst["head"] == "newer_foo"
assert lst["tail.head"] == "newer_bar"
assert lst["tail.tail.head"] == "newer_baz"

Creating types and adding features

from cassis import *

typesystem = TypeSystem()

parent_type = typesystem.create_type(name='example.ParentType')
typesystem.create_feature(domainType=parent_type, name='parentFeature', rangeType=TYPE_NAME_STRING)

child_type = typesystem.create_type(name='example.ChildType', supertypeName=parent_type.name)
typesystem.create_feature(domainType=child_type, name='childFeature', rangeType=TYPE_NAME_INTEGER)

annotation = child_type(parentFeature='parent', childFeature='child')

When adding new features, these changes are propagated. For example, adding a feature to a parent type makes it available to a child type. Therefore, the type system does not need to be frozen for consistency. The type system can be changed even after loading, it is not frozen like in UIMAj.

Sofa support

A Sofa represents some form of an unstructured artifact that is processed in a UIMA pipeline. It contains for instance the document text. Currently, new Sofas can be created. This is automatically done when creating a new view. Basic properties of the Sofa can be read and written:

cas = Cas(
    sofa_string = "Joe waited for the train . The train was late .",
    document_language = "en")

print(cas.sofa_string)
print(cas.sofa_mime)
print(cas.document_language)

Array support

Array feature values are not simply Python arrays, but they are wrapped in a feature structure of a UIMA array type such as uima.cas.FSArray.

from cassis import *
from cassis.typesystem import TYPE_NAME_FS_ARRAY, TYPE_NAME_ANNOTATION

typesystem = TypeSystem()

ArrayHolder = typesystem.create_type(name='example.ArrayHolder')
typesystem.create_feature(domainType=ArrayHolder, name='array', rangeType=TYPE_NAME_FS_ARRAY)

cas = Cas(typesystem=typesystem)

Annotation = cas.typesystem.get_type(TYPE_NAME_ANNOTATION)
FSArray = cas.typesystem.get_type(TYPE_NAME_FS_ARRAY)

ann = Annotation(begin=0, end=1)
cas.add(ann1)
holder = ArrayHolder(array=FSArray(elements=[ann, ann, ann]))
cas.add(holder)

Managing views

A view into a CAS contains a subset of feature structures and annotations. One view corresponds to exactly one Sofa. It can also be used to query and alter information about the Sofa, e.g. the document text. Annotations added to one view are not visible in another view. A view Views can be created and changed. A view has the same methods and attributes as a Cas .

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)
Token = typesystem.get_type('cassis.Token')

# This creates automatically the view `_InitialView`
cas = Cas()
cas.sofa_string = "I like cheese ."

cas.add_all([
    Token(begin=0, end=1),
    Token(begin=2, end=6),
    Token(begin=7, end=13),
    Token(begin=14, end=15)
])

print([x.get_covered_text() for x in cas.select_all()])

# Create a new view and work on it.
view = cas.create_view('testView')
view.sofa_string = "I like blackcurrant ."

view.add_all([
    Token(begin=0, end=1),
    Token(begin=2, end=6),
    Token(begin=7, end=19),
    Token(begin=20, end=21)
])

print([x.get_covered_text() for x in view.select_all()])

Merging type systems

Sometimes, it is desirable to merge two type systems. With cassis, this can be achieved via the merge_typesystems function. The detailed rules of merging can be found here.

from cassis import *

with open('typesystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

ts = merge_typesystems([typesystem, load_dkpro_core_typesystem()])

Type checking

When adding annotations, no type checking is performed for simplicity reasons. In order to check types, call the cas.typecheck() method. Currently, it only checks whether elements in uima.cas.FSArray are adhere to the specified elementType.

DKPro Core Integration

A CAS using the DKPro Core Type System can be created via

from cassis import *

cas = Cas(typesystem=load_dkpro_core_typesystem())

for t in cas.typesystem.get_types():
    print(t)

Miscellaneous

If feature names clash with Python magic variables

If your type system defines a type called self or type, then it will be made available as a member variable self_ or type_ on the respective type:

from cassis import *
from cassis.typesystem import *

typesystem = TypeSystem()

ExampleType = typesystem.create_type(name='example.Type')
typesystem.create_feature(domainType=ExampleType, name='self', rangeType=TYPE_NAME_STRING)
typesystem.create_feature(domainType=ExampleType, name='type', rangeType=TYPE_NAME_STRING)

annotation = ExampleType(self_="Test string1", type_="Test string2")

print(annotation.self_)
print(annotation.type_)

Leniency

If the type for a feature structure is not found in the typesystem, it will raise an exception by default. If you want to ignore these kind of errors, you can pass lenient=True to the Cas constructor or to load_cas_from_xmi.

Large XMI files

If you try to parse large XMI files and get an error message like XMLSyntaxError: internal error: Huge input lookup, then you can disable this security check by passing trusted=True to your calls to load_cas_from_xmi.

Citing & Authors

If you find this repository helpful, feel free to cite

@software{klie2020_cassis,
  author       = {Jan-Christoph Klie and
                  Richard Eckart de Castilho},
  title        = {DKPro Cassis - Reading and Writing UIMA CAS Files in Python},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3994108},
  url          = {https://github.com/dkpro/dkpro-cassis}
}

Development

The required dependencies are managed by pip. A virtual environment containing all needed packages for development and production can be created and activated by

virtualenv venv --python=python3 --no-site-packages
source venv/bin/activate
pip install -e ".[test, dev, doc]"

The tests can be run in the current environment by invoking

make test

or in a clean environment via

tox

Release

Make sure all issues for the milestone are completed, otherwise move them to the next
Checkout the main branch
Bump the version in cassis/__version__.py to a stable one, e.g. __version__ = "0.6.0", commit and push, wait until the build completed. An example commit message would be No issue. Release 0.6.0
Create a tag for that version via e.g. git tag v0.6.0 and push the tags via git push --tags. Pushing a tag triggers the release to pypi
Bump the version in cassis/__version__.py to the next development version, e.g. 0.7.0-dev, commit and push that. An example commit message would be No issue. Bump version after release
Once the build has completed and pypi accepted the new version, go to the Github release and write the changelog based on the issues in the respective milestone
Create a new milestone for the next version

dkpro-cassis's People

Contributors

Stargazers

Watchers

dkpro-cassis's Issues

Refactor Cas constructor

Compute namespaces instead of storing them
Do not take a list of annotations, but add a bulk add function
Do not take a sofa and view list, but add sofa support (see #5)

Make creating type classes lazy

Is your feature request related to a problem? Please describe.
Right now, when loading a typesystem, all Python classes representing types in the type system are created, whether they will be used later or not. This is slow for large type systems.

Describe the solution you'd like
Make it possible to create a type class when creating an annotation of that type for the first time.

Handle redefining of predefined types

Some CAS handling frameworks redefine predefined types like DocumentAnnotation. In order to not throw the error that the type already exists, it would be nice to check whether they are exactly the same type and then ignore it. Also, it would be nice to also redefine this type on export again to keep the type system as is.

Reading xmi then writing again results in broken xmi file.

If I load the attached xmi via load_cas_from_xmi and then use to_xmi without changing anything in the cas, INCePTION can not read the new xmi file.

Bundestag_08-7.zip

To Reproduce
Steps to reproduce the behavior:

with open('TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open('Bundestag_08-7.txt.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

cas.to_xmi('test_cas.xmi', pretty_print=True)

Then import test_cas.xmi into INCEpTION

Expected behavior
I would expect Inception to import the resulting xmi file, just as it did with the original xmi file.

Error message
Error while uploading document test_cas.xmi: XCASParsingException: Error parsing XMI-CAS from source at line -1, column -1: xmi id 1459 is referenced but not defined.

Please complete the following information:

Version: 0.2.0rc2
OS: Ubuntu

Error when reading XMI files with sofaURI

I'm trying to read one of my XMI files that I've always been able to read successfully in the past using Java UimaFit tools. I run the following code:

  ts_file = open('TypeSystem.xml', 'rb')
  type_system = load_typesystem(ts_file)

  xmi_file = open('/Users/Dima/Loyola/Data/Thyme/Xmi/ID169_clinic_496.xmi', 'rb')
  cas = load_cas_from_xmi(xmi_file, typesystem=type_system)

and I get this error:

Traceback (most recent call last):
  File "./cas.py", line 11, in <module>
    cas = load_cas_from_xmi(xmi_file, typesystem=type_system)
  File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 38, in load_cas_from_xmi
    return deserializer.deserialize(source, typesystem=typesystem)
  File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 71, in deserialize
    sofa = self._parse_sofa(elem)
  File "/usr/local/lib/python3.6/site-packages/cassis/xmi.py", line 188, in _parse_sofa
    return Sofa(**attributes)
TypeError: __init__() got an unexpected keyword argument 'sofaURI'

Any thoughts what's causing it? Thank you very much in advance.

Broken annotation type

Removing uima.tcas.DocumentAnnotation from _types property in the TypeSystem class in typesystem.py breaks load_cas_from_xmi.

This code was removed in a previous commit:

# DocumentAnnotation
t = self.create_type(name='uima.tcas.DocumentAnnotation', supertypeName='uima.tcas.Annotation')
self.add_feature(t, name='language', rangeTypeName='uima.cas.String')

Readd DocumentAnnotation as predefined type #41
Support types with no namespace #43

ValueError when reading a type system file that redefines a feature

Describe the bug

First of all, thank you very much for creating this useful tool.

Unfortunately, I stumble when I try to load the cTAKES (https://ctakes.apache.org) type system:

https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml

I get the following error:

ValueError: Feature with name [value] already exists in [org.apache.ctakes.typesystem.type.refsem.LabReferenceRange]!

To Reproduce

Save https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/ somewhere on your system and then do:

f = open('TypeSystem.xml', 'rb')
load_typesystem(f)

Expected behavior

I'm not totally sure why this error happens -- it's seems like this file (TypeSystem.xml) is a legitimate type system that's been used by cTAKES community for many years.

Please complete the following information:

I'm doing this on a Mac

Thank you very much in advance for looking into this.

High level helper functions for extracting typesystem independent annotations

Hi,

We are trying to extract token's text_strings and pos tags from cas objects. Also, different type systems lead to return different pos tags formats. @zesch Please correct me here if I am wrong.
It would be great to have some helper functions (some are shown in the below examples) that could solve these requests.

For example for the given cas object:

To return all the token texts:

 cas.get_token_strings() or cas.select(TOKEN).as_text()

To return all pos tags with ptb pos tag format:

cas.select(TOKEN).get_pos_tags(format='ptb')

We hope to see these helper functions as part of this API.

Thanks!!

Add version numbers

Use some library to add semantic versioning numbers and easy version bumping.

Get PEP8 compliant

Adhere to the Style Guide for Python Code

Fix sorting of annotations without begin and end

Right now, all annotations are stored in a sorted list ordered by (begin, end). This enables us to use binary search when looking for covered annotations. There is one list for each annotation type. Some annotations do not have begin or end, these are currently hacked to be sortable by using (sys.maxsize, sys.maxsize) as a key. It should be made so that only annotations are put into sorted lists that have (begin, end), the others should use an ordinary list. These should also be selectable with cas.select.

Error if supertype has no namespace

I run into a KeyError while loading a type system (see attachment) exported by IBM Watson Explorer:

Traceback (most recent call last):
  File "C:/Users/x/PycharmProjects/dkpro-cassis/tests/load_wex_ts.py", line 12, in <module>
    typeSystem = load_typesystem(ts_descriptor_file)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 538, in load_typesystem
    return deserializer.deserialize(source)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 632, in deserialize
    t = types[type_name]
KeyError: 'OpinionPhrase'

Process finished with exit code 1

This error occurs while processing a type named OpinionPhrase, which does not have namespace.
It does not occur when I insert this code as lines 572 and 573:

            if "." not in supertypeName:
                supertypeName = "uima.noNamespace." + supertypeName

However, this only means I get a second error later on. I don't know if it is related, but I describe it here anyway.

Traceback (most recent call last):
  File "C:/Users/x/PycharmProjects/dkpro-cassis/tests/load_wex_ts.py", line 12, in <module>
    typeSystem = load_typesystem(ts_descriptor_file)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 538, in load_typesystem
    return deserializer.deserialize(source)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 641, in deserialize
    t, name=f.name, rangeTypeName=f.rangeTypeName, elementType=f.elementType, description=f.description
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 480, in add_feature
    type_.add_feature(feature)
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 242, in add_feature
    self.__attrs_post_init__()
  File "C:\Users\x\PycharmProjects\dkpro-cassis\cassis\typesystem.py", line 175, in __attrs_post_init__
    self._constructor = attr.make_class(name, fields, bases=(FeatureStructure,), slots=True, eq=False, order=False)
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 2129, in make_class
    return _attrs(these=cls_dict, **attributes_arguments)(type_)
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 1002, in wrap
    builder.add_init()
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 689, in add_init
    self._is_exc,
  File "C:\Users\x\Anaconda3\lib\site-packages\attr\_make.py", line 1351, in _make_init
    bytecode = compile(script, unique_filename, "exec")
  File "<attrs generated init cassis.typesystem.com_ibm_es_tt_DocumentMetaData-19>", line 1
SyntaxError: duplicate argument 'self' in function definition

Process finished with exit code 1

The script it is trying to compile looks like this:

def __init__(self, xmiID=attr_dict['xmiID'].default, crawlspaceId=attr_dict['crawlspaceId'].default, crawlerId=attr_dict['crawlerId'].default, dataSource=attr_dict['dataSource'].default, dataSourceName=attr_dict['dataSourceName'].default, docType=attr_dict['docType'].default, encoding=attr_dict['encoding'].default, date=attr_dict['date'].default, url=attr_dict['url'].default, httpcode=attr_dict['httpcode'].default, documentId=attr_dict['documentId'].default, title=attr_dict['title'].default, deleteDocument=attr_dict['deleteDocument'].default, volatileURL=attr_dict['volatileURL'].default, associations=attr_dict['associations'].default, tags=attr_dict['tags'].default, type=attr_dict['type'].default, self=attr_dict['self'].default):
    self.xmiID = xmiID
    self.crawlspaceId = crawlspaceId
    self.crawlerId = crawlerId
    self.dataSource = dataSource
    self.dataSourceName = dataSourceName
    self.docType = docType
    self.encoding = encoding
    self.date = date
    self.url = url
    self.httpcode = httpcode
    self.documentId = documentId
    self.title = title
    self.deleteDocument = deleteDocument
    self.volatileURL = volatileURL
    self.associations = associations
    self.tags = tags
    self.type = type
    self.self = self

I'm working with conda 4.7.12 and python 3.7.4 on Windows 10.

exported_typesystem.zip

Release 0.2.0

Release 0.2.0 containing several improvements and bug fixes.

Add Travis CI integration

In order to let the tests run for every commit, Travis can be added as a free Continuous Integration service.

Add Pipenv
Add setup.py
Add travis integration

Add Read The Docs

In order to auto generate documentation on every commit, readthedocs can be used as a free service.

Add sphinx to the project
Write documentation
Add integration for readthedocs

while deserializing xmi: AttributeError ... object has no attribute 'sofa

deserializing xmi with myType and its supertypeName=uima.cas.TOP
an AttributeError: myType object has no attribute 'sofa'
...
/cassis/cas.py", line 169, in add_annotation
annotation.sofa = self.get_sofa().xmiID

is raised.

(Please note that processing xmi in java is not an issue)

Add extensions for use cases from DKPro Core and cTAKES to the CAS interface

it would be nice to be able to initialize a CAS with a certain type system, e.g.

from somewhere import DKProCoreTypeSystem
from cassis import Cas

cas = Cas(DKProCoreTypeSystem())

Support type inheritance

Types in an UIMA type system can inherit from other types. This can be specified via the supertypeName tag. This inheritance needs to be supported, so that a type inherits features from its super type.

Add more cas select methods

Is your feature request related to a problem? Please describe.
It would be nice to add more cas select methods like CasUtil does.

Describe the solution you'd like

Update README with package information

Add badges
Specify Python version in setup.py

Add get_covered_text method to Annotations

Is your feature request related to a problem? Please describe.
Right now, in order to get the covered text, one has to call cas.get_covered_text(annotation). That is not nice and different from Java.

Describe the solution you'd like
Use annotation.get_covered_text()

Additional context
If the annotation is a feature structure, i.e. does not cover text, an exception should be thrown.

Add test files from the Apache UIMA Java SDK

It would be nice to add the test files that UIMA uses to the cassis test suite.
Tests can be found under e.g. https://github.com/apache/uima-uimaj/blob/a68b57a8eaca2ee5392903d8f2f86bca5df08054/uimaj-core/src/test/java/org/apache/uima/cas/impl/XmiCasDeserializerTest.java

Files can e.g. be found under
https://github.com/apache/uima-uimaj/tree/a68b57a8eaca2ee5392903d8f2f86bca5df08054/uimaj-core/src/test/resources/ExampleCas

Support types with no namespace

Is your feature request related to a problem? Please describe.
Some UIMA CAS handling frameworks define type systems with types that have no namespace, e.g.

    <typeDescription>
        <name>ArtifactID</name>
        <description>A unique artifact identifier.</description>
        <supertypeName>uima.cas.TOP</supertypeName>
        <features>
            <featureDescription>
                <name>artifactID</name>
                <description>A unique identification string for the artifact. This should be the file name for files,
        or the unique identifier used in a database if the document source is a database
        collection reader.</description>
                <rangeTypeName>uima.cas.String</rangeTypeName>
            </featureDescription>
        </features>
    </typeDescription>

Describe the solution you'd like
When importing these types, prefix their name with uima.noNamespace. This is apparently how it was done in the past.

Describe alternatives you've considered
Not supporting this, but it does not cost much and supports old systems.

Add Coverage

In order to monitor test coverage, add Coverage integration.

Take type system into account when adding annotations to CAS

Right now, the CAS is not aware of the type system. That means, it is possible to add types to a CAS that are not in the type system. It should be the default to validate types to allow only adding types that are in the type system.

Checking whether a feature is a collection is wrong for some collections

Describe the bug

Had to modify this chunk of code in CasXmiDeserializer (@ line 151 in xmi.py) to make it partially work:

 # Resolve references
              # NB: need to ensure `value` is of type str, otherwise error is thrown 
              if typesystem.is_collection(feature.rangeTypeName) and isinstance(value, str):
                  # A collection of references is a list of integers separated
                  # by single spaces, e.g. <foo:bar elements="1 2 3 42" />
                  targets = []
                  
                  for ref in value.split():
                      target_id = int(ref)
                      target = feature_structures[target_id]
                      targets.append(target)
                  setattr(fs, feature_name, targets)
               
              else:
                  # NB: need to ensure `value` is of type int, otherwise error when casting as int
                  if isinstance(value, int):
                      target_id = int(value)
                      target = feature_structures[target_id]
                      setattr(fs, feature_name, target)

              # NB: `value` can be list of strings
              if not isinstance(value, str):
                  #print(" ".join(value))
                  pass

To Reproduce
Run current version of cassis on attached files with this script (attached test.zip to reproduce errors):

test.zip

from cassis import *

dir_test = `<Path to files>`
with open(dir_test + 'TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)
       
[test.zip](https://github.com/dkpro/dkpro-cassis/files/3655535/test.zip)

# add missing types
t = typesystem.create_type(name='org.apache.uima.examples.SourceDocumentInformation', supertypeName='uima.tcas.Annotation')
typesystem.add_feature(t, name='uri', rangeTypeName='uima.cas.String')
typesystem.add_feature(t, name="offsetInSource", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="documentSize", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="lastSegment", rangeTypeName="uima.cas.Integer")
    
fname = '737-v1.txt.xmi'
with open(dir_test + fname, 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

view = cas.get_view('_InitialView')
print([x for x in view.select("org.metamap.uima.ts.Candidate")])

Expected behavior

-> Expect arrays of all types to be loaded, including if it's a list.
-> Not sure how to deal with this chunk:

else:
  if isinstance(value, int):
      target_id = int(value)
      target = feature_structures[target_id]
      setattr(fs, feature_name, target)

Please complete the following information:

Version: -e git+https://github.com/dkpro/dkpro-cassis.git@f0bd4bc167ff67301e786f8437f59e6e50b57bd8#egg=dkpro_cassis
OS: OS X

Inconsistent xmiID

Describe the bug

xmiID is getting changed somewhere along the way.

To Reproduce

Add the following line to the end of _parse_annotations:

if attributes["xmiID"] == 52613:

            print("TEST:", attributes)

With included XMI and Typesystem files, run the following:

with open(dir_test + 'TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)
    
fname = '737-v1.txt.xmi'
with open(dir_test + fname, 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

view = cas.get_view('Analysis')

print([x for x in view.select("biomedicus.v2.UmlsConcept")])

You will see:

TEST: {'sofa': 18, 'begin': 10, 'end': 14, 'sui': 'S0628538', 'cui': 'C2740799', 'tui': 'T121', 'source': 'MTHSPL', 'confidence': '1.0', 'xmiID': 52613}
[biomedicus_v2_UmlsConcept(xmiID=9826, sui='S0628538', cui='C2740799', tui='T121', source='MTHSPL', confidence='1.0', begin=10, end=14, sofa=6, type='biomedicus.v2.UmlsConcept')...]

The xmiIDs are not the same (they should be, since they are the first set of annotations for this type).

Expected behavior

Expect the output xmiID to not change.

I suspect this has something to do with the AnnotationBase class definition, but have not had a chance to examine this further.

Please complete the following information:

Version: 0.2.0.dev0
OS: MacOS

Archive.zip

Improve package import structure

I would simplify the current package import structure (in the __init__.py):

cassis contains all things like Cas, TypeSystem, Sofa, Feature, Type
cassis.xmi contains loader/saver for XMI
cassis.typesystem contains loader/saver for type systems

Add different docs for last release and master

Add two documentation links, one for master and one for the last release.

Circular annotations

Spoke too soon. Both cTAKES and Clamp both have circular annotations (or linked, if you want to get pedantic).

Getting the following error:

---------------------------------------------------------------------------
CircularDependencyError                   Traceback (most recent call last)
<ipython-input-4-ff102eb39ae5> in <module>()
      7 dir_test = '/Users/gms/development/nlp/nlpie/data/medinfo/ctakes_out/'
      8 with open(dir_test + 'TypeSystem.xml', 'rb') as f:
----> 9     typesystem = load_typesystem(f)
     10     #print(dir(typesystem))
     11     #print(typesystem.get_types())

/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in load_typesystem(source)
    425         return deserializer.deserialize(BytesIO(source.encode("utf-8")))
    426     else:
--> 427         return deserializer.deserialize(source)
    428 
    429 

/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in deserialize(self, source)
    482 
    483         ts = TypeSystem()
--> 484         for type_name in toposort_flatten(dependencies, sort=False):
    485             # No need to create predefined types
    486             if type_name in PREDEFINED_TYPES:

/anaconda3/lib/python3.7/site-packages/toposort.py in toposort_flatten(data, sort)
     88 
     89     result = []
---> 90     for d in toposort(data):
     91         result.extend((sorted if sort else list)(d))
     92     return result

/anaconda3/lib/python3.7/site-packages/toposort.py in toposort(data)
     79                     if item not in ordered}
     80     if len(data) != 0:
---> 81         raise CircularDependencyError(data)
     82 
     83 

CircularDependencyError: Circular dependencies exist among these items: {'org.apache.ctakes.typesystem.type.textsem.SemanticArgument':{'org.apache.ctakes.typesystem.type.textsem.SemanticRoleRelation'}, 'org.apache.ctakes.typesystem.type.textsem.SemanticRoleRelation':{'org.apache.ctakes.typesystem.type.textsem.SemanticArgument'}}

These are definitely legit, and can be viewed in the UIMA CVD.

Any ideas on how to implement, and I will do the work. Again, I could just remove these annotations, especially since we are not interested in them, but when dealing with 50 million x 2 files, that is a lot of processing time.

Worst case, we can modify the AE config file to not print these as output.

Release 0.1.1 and publish it on pypi

It would be nice to have a version 0.1.0 to be released

Add function to select first covering annotation

Is your feature request related to a problem? Please describe.
It is annoying to get the first covering annotation e.g. of a sentence.

Describe the solution you'd like
Add function to select first covering annotation

Rename `Annotation` to `FeatureStructure`

When I started this project, I thought everything in a CAS is an annotation. That is not the case, the basis should be feature structure. In order to fix the naming, it is necessary to rename class and argument names.

Document that you can change the typesystem after creating it

Is your feature request related to a problem? Please describe.
Advertise this somewhere because one of the things people tend find limiting about the Java implementation is that the the type system cannot be changed after the CAS has been initialized, in particular no types can be added and no features can be added to types.

Describe the solution you'd like
Write it in the readme.

Additional context
UIMA Java cannot do it and it is annoying.

Creating cas without typesystem shares the default typesystem

Describe the bug
Creating cas without typesystem shares the default typesystem

To Reproduce
Steps to reproduce the behavior:

Create two casses with cas1 = Cas(); cas2 = Cas()
Add a new type to both
Error is thrown due to duplicate type

Expected behavior
Casses both have different type system

Support feature structures referencing other feature structures

Describe the bug

We have an annotation type that has a feature structure that is referenced via a feature/attribute, but the feature structure is not being displayed in the loaded CAS object.

To Reproduce

I am able to query the annotation type just fine, but the feature structure does not show up when I do a load_cas_from_xmi

Annotation type in question is: textsem:MedicationMention (or ProcedureMention, etc) and is linked to refsem:UmlsConcept via ontologyConceptArr.

I've tried:

print([x for x in cas.select('org.apache.ctakes.typesystem.type.refsem.UmlsConcept')])
print([x for x in cas.select_all()])

and

view = cas.get_view('_InitialView')
print([x for x in view.select_all()])

all to no avail.

Expected behavior

Expect to see refsem:UmlsConcept in the displayed CAS.

Please complete the following information:

Version: [0.2.0.dev0]
OS: OS X

Additional context

Attached are XMI and Typesystem example:
ctakes_out.zip

Loading a type system in which a child type redefines a parent type throws.

Describe the bug
Loading a type system in which a child type redefines a parent type throws.

To Reproduce
Steps to reproduce the behavior:

Loading a type system in which a child type redefines a parent type.
See the error

Expected behavior
No error

Fix CAS XMI namespace serialization

Right now, the name spaces are repeated for every element which heavily bloats the resulting XMI string:

e.g.

<cas:View xmlns:cas="http:///uima/cas.ecore" xmlns:chunk="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/chunk.ecore" xmlns:constituent="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/constituent.ecore" xmlns:dependency="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/dependency.ecore" xmlns:morph="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/morph.ecore" xmlns:pos="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmlns:tweet="http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/tweet.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/coref/type.ecore" xmlns:type2="http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore" xmlns:type3="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmlns:type4="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:type5="http:///de/tudarmstadt/ukp/dkpro/core/api/semantics/type.ecore" xmlns:type6="http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type.ecore" xmlns:type7="http:///de/tudarmstadt/ukp/dkpro/core/api/transform/type.ecore" xmlns:type8="http:///de/tudarmstadt/ukp/inception/api/kb/type.ecore" xmlns:type9="http:///de/tudarmstadt/ukp/inception/recommendation/api/type.ecore" xmlns:xmi="http://www.omg.org/XMI" sofa="12" members="1 42"/>

The goal is to have the declarations just at the top, the elements itself should use prefixes and not define new namespaces.

Add support for DKPro type system and annotations

it would be nice to be able to initialize a CAS with a certain type system, e.g.

from somewhere import DKProCoreTypeSystem
from cassis import Cas

cas = Cas(DKProCoreTypeSystem())

In order to verify that the DKPro typeystem is supported by cassis, the typesystem xml and an example CAS should be added to the tests.

Bug: cassis cannot be loaded

I have created a new virtualenv and installed casis using following commands

>>>:~/ukp/test_new_cas$ virtualenv env --python=python3.6 
Running virtualenv with interpreter /usr/bin/python3.6
Using base prefix '/usr'
New python executable in /home/weaponxiii/ukp/test_new_cas/env/bin/python3.6
Also creating executable in /home/weaponxiii/ukp/test_new_cas/env/bin/python
Installing setuptools, pip, wheel...done.

>>>:~/ukp/test_new_cas$ source env/bin/activate

(env) >>>:~/ukp/test_new_cas$ python -m pip install git+https://github.com/dkpro/dkpro-cassis
Collecting git+https://github.com/dkpro/dkpro-cassis
  Cloning https://github.com/dkpro/dkpro-cassis to /tmp/pip-req-build-a3u2ig4q
Collecting lxml (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/03/a4/9eea8035fc7c7670e5eab97f34ff2ef0ddd78a491bf96df5accedb0e63f5/lxml-4.2.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting attrs (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
Collecting sortedcontainers (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/be/e3/a065de5fdd5849450a8a16a52a96c8db5f498f245e7eda06cc6725d04b80/sortedcontainers-2.0.5-py2.py3-none-any.whl
Collecting toposort (from cassis==0.0.1)
  Using cached https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: cassis
  Running setup.py bdist_wheel for cassis ... done
  Stored in directory: /tmp/pip-ephem-wheel-cache-j0ltlofg/wheels/c2/d1/6c/2b0e0b6c03d1d64f9782fb6e3b174abaaed41ff3c62b2cafa5
Successfully built cassis
Installing collected packages: lxml, attrs, sortedcontainers, toposort, cassis
Successfully installed attrs-18.2.0 cassis-0.0.1 lxml-4.2.5 sortedcontainers-2.0.5 toposort-1.5

and pip show also recongizes it

(env) >>>:~/ukp/test_new_cas$ python -m pip show cassis
Name: cassis
Version: 0.0.1
Summary: UIMA CAS processing library in Python
Home-page: https://github.com/dkpro/dkpro-cassis
Author: Jan-Christoph Klie
Author-email: [email protected]
License: Apache License 2.0
Location: /home/weaponxiii/ukp/test_new_cas/env/lib/python3.6/site-packages
Requires: attrs, lxml, toposort, sortedcontainers
Required-by:

But for some reason it can not be imported

(env) >>>:~/ukp/test_new_cas$ python
Python 3.6.6 (default, Sep 12 2018, 18:26:19) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cassis
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cassis'
>>> import imp
>>> imp.find_module('cassis')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/weaponxiii/ukp/test_new_cas/env/lib/python3.6/imp.py", line 297, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'cassis'
>>> imp.find_module('json')
(None, '/usr/lib/python3.6/json', ('', '', 5))
>>>

I have tested the same procedure on this project to install and use other packages ( eg. h5py ) successfully but so far no luck with cassis

XMI serialising throws if CAS contains reference to empty array

Describe the bug
I exported a simple test case from WebAnno 3.6 (no custom tags, only the default ones from WebAnno), which I read using cassis. Trying to serialise it again results in an error.

To Reproduce

from cassis import *

path = "path/to/test_case/"

with open(path + '/TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

with open(path + '/test_file.xmi', 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

serialised = cas.to_xmi(path=None)

If possible, attach a minimal file that triggers the error.
webanno_test_case.zip

Expected behavior
Serialised file returned without errors.

Error message

  File "<ipython-input-5-cf0db0819547>", line 12, in <module>
    serialised= cas.to_xmi(path=None)
  File "..../dkpro-cassis/cassis/cas.py", line 333, in to_xmi
    serializer.serialize(sink, self, pretty_print=pretty_print)
  File ".../dkpro-cassis/cassis/xmi.py", line 243, in serialize
    for annotation in sorted(self._find_all_fs(cas), key=lambda a: a.xmiID):
  File ".../dkpro-cassis/cassis/xmi.py", line 276, in _find_all_fs
    for referenced_fs in lst:
TypeError: 'NoneType' object is not iterable

Please complete the following information:

Version: 0.2.1
OS: Ubuntu 18 and MacOS

Additional context
Similar situation to #64 but with a different test case.

Add proper DocumentMetaData support

Right now, DocumentMetaData is treated as an ordinary annotation. It would be better to parse it specially right into fields that can be queried from the CAS. This has several reason:

It should only exist once. Assuming it is always the first annotation makes implementation easier.
It is normally serialized as the first annotation. Adding extra handling makes it easier, because the other annotations are sorted by type name which is contradicted by DocumentMetaData when used as an ordinary annotation.
It is a predefined type, but sometimes not defined in the type system. Some extra care is needed to make sure it is only defined once but also serialized.

Allow circular dependencies between ranges and supertypes

Is your feature request related to a problem? Please describe.
Right now, the dkpro-xml type system cannot be parsed, as there is a circular dependency somewhere

Circular dependencies exist among these items: 
{
    "org.dkpro.core.api.xml.type.XmlDocument": {"org.dkpro.core.api.xml.type.XmlElement"}, 
"org.dkpro.core.api.xml.type.XmlElement":{"org.dkpro.core.api.xml.type.XmlNode"}, 
"org.dkpro.core.api.xml.type.XmlNode":{"org.dkpro.core.api.xml.type.XmlElement"}, 
"org.dkpro.core.api.xml.type.XmlTextNode":{"org.dkpro.core.api.xml.type.XmlNode"}
}

Describe the solution you'd like
Build it so it works.

Add basic sofa/view support

Annotations should be added to a CAS/sofa, instead of specifying sofa as a parameter.
Allow getting sofas from a CAS
Documentation

Parsing Error with WebAnno UIMA XMI format

Hello, cassis team,

I need to parse a XMI file, which is produced by WebAnno in Python.
I'm totally new to UIMA or XMI format, so it was lucky for me to discover Casis. Thank you developers.

Unfortunately, the code snippet provided (https://github.com/dkpro/dkpro-cassis#selecting-annotations) doesn't work well for the attached file below.

When I load the file with 'load_cas_from_xmi()' method, the 'Cas' class initialize itself with 'Cas._sofas' to be dictionary key of 1.
However, the right dictionary key for the attached example is 12.

How can I make the 'Cas' class to get the right SOFA key of 12?

Plus, is it the same as 'selecting annotations' for retrieving pos-tags that are tagged on tokens?
If Casis is not the best option for parsing the attached file, please recommend other alternatives.

thank you so much.
webanno629617483446633113export.zip

Allow Python 3.5

Right now, attribute typing is used which is a feature of >= Python3.6:

@attr.s(slots=True)
class Sofa:
    """Each CAS has one or more Subject of Analysis (SofA)"""

    sofaNum: int = attr.ib()  #: The sofaNum

This is the difference between Python 3.5 and Python 3.6 support in cassis from what I saw. It is possible to annotate these with comments instead syntax. This should be done in order to support python 3.5.

Swap order of assert arguments in unit tests

When doing asserts in unit tests with pytest, the order should be assert actual == expected. Right now, the order used is assert expected == actual which needs to be changed in order to make the error message correct.

Unexpected modification of existing document?

It looks like cassis is omitting existing XML namespace attributes. Also, it's rewriting xmi:id attribute of preexisting XML Nodes. Is this resulting from getting rid of some form of redundant information? Or is this a bug?

Take a look at the attached documents

prediction.zip

Implementing support for array features

I got the BioMedICUS annotations working with cassis.

Now, on to MetaMap. These have an odd way of representing arrays. For example, StringArray is represented as

        <cas:StringArray xmi:id="1373">
		<elements>CSP</elements>
		<elements>LCH</elements>
		<elements>LCH_NW</elements>
		<elements>LNC</elements>
		<elements>MSH</elements>
		<elements>MTH</elements>
		<elements>NCI</elements>
		<elements>NCI_CDISC</elements>
		<elements>NCI_FDA</elements>
		<elements>NCI_NICHD</elements>
		<elements>SNMI</elements>
		<elements>SNOMEDCT_US</elements>
	</cas:StringArray>

When processing the XMI, this throws the error that:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-94fafef6801b> in <module>()
     20 # 528715 737-v1
     21 with open(dir_test + '737-v1.txt.xmi', 'rb') as f:
---> 22     cas = load_cas_from_xmi(f, typesystem=typesystem)
     23     #print(cas.sofas)
     24     #print(dir(cas))

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in load_cas_from_xmi(source, typesystem)
     36         return deserializer.deserialize(BytesIO(source.encode("utf-8")), typesystem=typesystem)
     37     else:
---> 38         return deserializer.deserialize(source, typesystem=typesystem)
     39 
     40 

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in deserialize(self, source, typesystem)
     71                 views[proto_view.sofa] = proto_view
     72             else:
---> 73                 annotation = self._parse_annotation(typesystem, elem)
     74                 annotations[annotation.xmiID] = annotation
     75 

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in _parse_annotation(self, typesystem, elem)
    120         typename = elem.tag[9:].replace("/", ".").replace("ecore}", "")
    121 
--> 122         AnnotationType = typesystem.get_type(typename)
    123         attributes = dict(elem.attrib)
    124 

/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in get_type(self, typename)
    336             return self._types[typename]
    337         else:
--> 338             raise Exception("Type with name [{0}] not found!".format(typename))
    339 
    340     def get_types(self) -> Iterator[Type]:

Exception: Type with name [] not found!

Any suggestions on how to deal with this? Again, the NLP annotator is not in our control, so the data are what they are, for good or bad.

Thanks!

Allow leniency to CAS deserialization

Is your feature request related to a problem? Please describe.
Right now, when referencing a type in a CAS that is not in the type system, an error is thrown. It would be nice to specify that it should just be ignored.

It has to be determined whether this should be for types or also for features.

Describe the solution you'd like
Add a lenient flag like uimaj does.

dkpro / dkpro-cassis Goto Github PK

dkpro-cassis's Introduction

dkpro-cassis

Features

Installation

Usage

Reading a CAS file

Writing a CAS file

Creating a CAS

Adding annotations

Selecting annotations

Getting and setting (nested) features

Creating types and adding features

Sofa support

Array support

Managing views

Merging type systems

Type checking

DKPro Core Integration

Miscellaneous

If feature names clash with Python magic variables

Leniency

Large XMI files

Citing & Authors

Development

Release

dkpro-cassis's People

Contributors

Stargazers

Watchers

Forkers

dkpro-cassis's Issues

Recommend Projects

Recommend Topics

Recommend Org