julielab / jcore-base Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 11.0 212.28 MB

Base modules of JCoRe

License: BSD 2-Clause "Simplified" License

TeX 2.79% Java 95.68% HTML 0.19% xBase 0.29% Scilab 0.01% Shell 0.45% Python 0.53% XSLT 0.06%

jcore-base's People

Contributors

Stargazers

Watchers

Forkers

dom-s egrygorova qiuyuew phsieg bkmackellar shadowridgedev lipga aryachanwu jackykang061233 myyyvothrr

jcore-base's Issues

mst-parser: not working in cpe

... as the title says.
Either it has to do with some missing infos in the cpe itself, or with the unconventional way of handling shared resources.
Pipeline crashes, because it doesn't find the config file (that's only a deduction and nothing that is reported).

streamlining ".gitignore"

Right now, the .gitignore file of the jcore-base parent folder only specifies "ignores" for its own level and every subfolder has it's own .gitignore that specifies some similar "ignores" (like .classpath, .settings etc., i.e. IDE specific files that appropriately shouldn't be committed).
However, it would be nice, to have all these "global ignores" in the parent .gitignore and leave the project .gitignores for local "ignores" only. You can do this with the asteriks

A slash followed by two consecutive asterisks then a slash matches zero or more directories. For example, "a/**/b" matches "a/b", "a/x/b", "a/x/y/b" and so on.

future work: fix jcore-coordination-ae

postponed until we have some more wo_manpower

Move type system jcore-affect-type.xml to extensions

jcore-txt-consumer: add pos feature & README

right now the recently added consumer just outputs sentences (one per line) and tokens (whitespace separated).
it also needs a feature to add POS information to each token if available (& desired) so that the POS tag is appended to each token with a designated delimiter (e.g. _ or |).
Furthermore README :)
@egrygorova: Please contact me when you're done with your former task so we can tackle this one.

jcore-txt-consumer: Change to logback

Its currently set to log4j

file-reader: fix tests

The tests don't run because resource files are missing. I set the tests to ignore for the moment, which obviously shouldn't stay that way.

JSBD: Postprocessing

Right now, the Postprocessing done by JSBD is hardcoded, as well as an abbrevation lexicon.
The postprocessing steps were probably fitted to "biomedical english abstracts". This results in some issues with "medical german discharge summaries etc."

At least the abbrevation lexicon should be "outsourced".

gene-species-assigner: error

I get an error in Eclipse for this project (but there isn't any further indication to what this error might be). When I try to deploy jcore-base-SNAPSHOT it throws an error as well, when coming to the point of preparing the assigner.
Is it just me?

JEmAS too little information

Please write the README for JEmAS.

Add missing license, add list of third-party requirements and licenses in case they differ

As of Mar 24, 2017 our licensing policy for

Software: permissive licenses (MIT or BSD) if possible
Resources: CC-BY if possible.

"
There is an ongoing group-intern debate on whether source code (e.g., the JCoRe code we distribute via GitHub) must comply with the licenses of the employed libraries if those are not included in the distribution and instead only referenced as a maven dependency (that would mean that using the API in the source makes the the source a derivative) or the source code rather constitutes a work of its own and only building the project merges source and libraries (thus constituting a derivative).
"

Anyways, I would recommend setting the BSD-2 clause license now and a prominent note at the top of the REAMDE that third party libraries might have different licenses -- better: make a list of all the third-party dependencies and their licenses respectively.

jnet tests not working

due to the special version of the uea-stemmer packaged in jcore-dependencies, old models are not compatible anymore (can't be loaded due to differences in the uea objects)

OpenNLP parser

opennlp parser needs some performance testing

BioNLP09 consumer cannot be created

When using the BioNLP09 consumer from a CPE descriptor in the following way

<casProcessor deployment="integrated" name="BioNLP09Writer"> -->
            <descriptor>
                <import name="de.julielab.jcore.consumer.bionlp09event.desc.jcore-bionlp09event-consumer" />
            </descriptor>
            <deploymentParameters />
            <configurationParameterSettings>
                <nameValuePair>
                    <name>Directory</name>
                    <value>
                        <string>st09-bionlp09-format</string>
                    </value>
                </nameValuePair>
            </configurationParameterSettings>
            <errorHandling>
                <errorRateThreshold action="terminate" value="10/1000" />
                <maxConsecutiveRestarts action="terminate"
                    value="30" />
                <timeout max="100000" default="-1" />
            </errorHandling>
            <checkpoint batch="1" time="1000ms" />
        </casProcessor>

on initialization of the CPE via the runCPE.sh script I get the message

<stack trace>
Caused by: java.lang.Exception: The component BioNLP09Writer cannot be created. (Thread Name: main)

PMC reader check section offset

When indexing for semedico, apparently every document print a warning like this:

11:30:21.091 [[Procesing Pipeline#2 Thread]::] WARN d.j.j.c.e.AbstractFieldsGenerator - Section annotation in document 4321653 occured with begin=8006 and end=9270 (document text length: 9270). Ignoring

Check if the offsets of the sections are correct.

FileReader test failing

The test fails due to the existence of .gitignore in the input directory. Has been fixed in the 2.1.x-bugfixes branch which would have to be merged.

jcore mantra xml types unclear

There is no information about it.
Please write the README for this type system.

PMCReader: Set correct affiliation for authors

Currently only the reference ID is set.

Descriptor Versions wrong

Basically the version mentioned in all UIMA descriptors are wrong (still at 2.0.0 or even 2.0.0-SNAPSHOT). There should be some sort of script that automatically adapts those versions for all component and type descriptors.

BioNLP09Consumer: Create output dirs

The consumer does apparently not create the directories to the output directory itself. This is weird when using the default configuration. It should just work out-of-the-box, the output directories should automatically be created, if not existing.

Add the jules-ign-reader to JCoRe

Just convert it to the JCoRe conventions.

jcore-txt-consumer: add a descriptor

Create a descriptor

jcore-txt-consumer: allow plain CAS text output

Sometimes it is important to have access to the exact CAS string as it is stored within UIMA. For instance, when outputting the text with annotations in an offset-based manner, the original document string must not be changed.

JCoRe File Reader: adding features

the file reader could use a small feature that allows it to also browse through subdirs

JSBD: Set document structure borders

JSBD should be able to be given a number of annotation types where sentences should always end (titles, sections etc). This would fix a range of issues in full texts.

can't find Sentence definition in jcore-types

LingPipeGazetteerAnnotator unclear

Too complicated to write the README properly.
Please review the README.

muc7-reader

added a script that converts sgml to xml files expected by the reader; however the script is just rudimentary right now, but works for all files of the following structure:

<DOC>
<DOCID> nyt960214.0765 </DOCID>
<STORYID cat=a pri=r> A4505 </STORYID>
<SLUG fv=ttx-z> BC-<COREF ID="1">PANTEX</COREF>-<COREF ID="3">FLIGHTS</COREF>-TEX </SLUG>
<DATE> <COREF ID="104">02-14</COREF> </DATE>
<NWORDS> 0535 </NWORDS>
<PREAMBLE>
[...]

The script takes as argument the name of the file to convert
python muc7_SGML2XML.py training.tr.keys.980410
and produces a file with the same name but an additional ".xml" ending.

But:
it seems the reader doesn't annotate coreferences in the CAS? Need to investigate!

jcore-txt-consumer: add document mode to readme

The current readme is some copy of the JSBD readme.

Redundant annotation type lemma?

In the type system (jcore-morpho-syntax-types.xml), "lemma" is a feature of the annotation type token. The value of the lemma feature has to be of type lemma which in turn has only one feature of type string.

Why not deleting this intermediate lemma type and set the lemma feature of the type token to be a string right away?

jcore utilities no information

Please write the README for jCore utilities.

Diverging folder structure

The folder structure in jcore-base/jcore-opennlp-token-ae/ does not conform to most of the other components.

Also, the readme file needs editing ;-)

mstparser needs new models

mstparser now depends on trove4j and not jules-trove
therefore we need a new test-model
this will also affect the mstparser projects in jcore-projects
we need a new genia model there as well

modify pom file before next release/deploy; add 'missing' projects => branch: issue5

make sure that

javadoc/source packaging is moved to jcore-parent
~~JULIE Coordination Resolution is added~~
OpenNLP Constituency Parser is added

Gene Species Assigner: Syso to Log message

The log message "No organisms" is actually quite helpful during development. However, it is currently just a system.out.println. Please make that a logging message.

de.julielab.jcore.types.AbstractPart in jcore-type-priorities.xml

I am looking for the type de.julielab.jcore.types.AbstractPart mentioned in jcore-type-priorities.xml, but I cannot find it in the type systems. Did I miss it?

JCoRe Readme's!

@JULIELab/core Just a sidenote (but an important one): if you add a component to the repository or even if you "only" add features to a component, please create/update the readme's accordingly.

xmi consumer: tests

changed the class from Consumer to generic AE
need to fix some tests

xml mapper and xml reader README's

Please review those README's, thank you.

jsbd/jtbd: mallet

they both utilize now the "official mallet" artifact from Maven Central and not our "homebrewn" mallet version. The new performance for the FraMed corpus is documented. Need to check for e.g. Genia as well?!

jcore-stanford-lemmatizer...ae?

All (?) AEs have some "ae" in the name, but the stanford lemmatizer has not. Is there a particular reason for that? I think it was jcore-stanford-lemmatizer-ae in the past it has been changed. AE or not AE, that is the question!

Add an XMI reader

We have the writer in JCoRe, but we cannot easily read the XMIs. There is an XMI reader in Jules...

JEmAs -> jcore-projects

Just for the sake of completion and consistency (as discussed the other day): we should outsource the lexicon (plus descriptor) to a jcore-projects project; similar to e.g. jcore-jsbd has two referencing projects medical and biomedical.

modify bionlpst reader/consumer (?)

so that they're able to read all st13 information as well

missing tests

jcore iexml-{reader,consumer}