Giter VIP home page Giter VIP logo

jcore-base's People

Contributors

chlor avatar dependabot[bot] avatar egrygorova avatar fmatthies avatar hellrich avatar khituras avatar phsieg avatar pikatech avatar svenbuechel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jcore-base's Issues

mst-parser: not working in cpe

... as the title says.
Either it has to do with some missing infos in the cpe itself, or with the unconventional way of handling shared resources.
Pipeline crashes, because it doesn't find the config file (that's only a deduction and nothing that is reported).

streamlining ".gitignore"

Right now, the .gitignore file of the jcore-base parent folder only specifies "ignores" for its own level and every subfolder has it's own .gitignore that specifies some similar "ignores" (like .classpath, .settings etc., i.e. IDE specific files that appropriately shouldn't be committed).
However, it would be nice, to have all these "global ignores" in the parent .gitignore and leave the project .gitignores for local "ignores" only. You can do this with the asteriks

A slash followed by two consecutive asterisks then a slash matches zero or more directories. For example, "a/**/b" matches "a/b", "a/x/b", "a/x/y/b" and so on.

jcore-txt-consumer: add pos feature & README

right now the recently added consumer just outputs sentences (one per line) and tokens (whitespace separated).
it also needs a feature to add POS information to each token if available (& desired) so that the POS tag is appended to each token with a designated delimiter (e.g. _ or |).
Furthermore README :)
@egrygorova: Please contact me when you're done with your former task so we can tackle this one.

file-reader: fix tests

The tests don't run because resource files are missing. I set the tests to ignore for the moment, which obviously shouldn't stay that way.

JSBD: Postprocessing

Right now, the Postprocessing done by JSBD is hardcoded, as well as an abbrevation lexicon.
The postprocessing steps were probably fitted to "biomedical english abstracts". This results in some issues with "medical german discharge summaries etc."

At least the abbrevation lexicon should be "outsourced".

gene-species-assigner: error

I get an error in Eclipse for this project (but there isn't any further indication to what this error might be). When I try to deploy jcore-base-SNAPSHOT it throws an error as well, when coming to the point of preparing the assigner.
Is it just me?

Add missing license, add list of third-party requirements and licenses in case they differ

As of Mar 24, 2017 our licensing policy for

  • Software: permissive licenses (MIT or BSD) if possible
  • Resources: CC-BY if possible.

"
There is an ongoing group-intern debate on whether source code (e.g., the JCoRe code we distribute via GitHub) must comply with the licenses of the employed libraries if those are not included in the distribution and instead only referenced as a maven dependency (that would mean that using the API in the source makes the the source a derivative) or the source code rather constitutes a work of its own and only building the project merges source and libraries (thus constituting a derivative).
"

Anyways, I would recommend setting the BSD-2 clause license now and a prominent note at the top of the REAMDE that third party libraries might have different licenses -- better: make a list of all the third-party dependencies and their licenses respectively.

jnet tests not working

due to the special version of the uea-stemmer packaged in jcore-dependencies, old models are not compatible anymore (can't be loaded due to differences in the uea objects)

BioNLP09 consumer cannot be created

When using the BioNLP09 consumer from a CPE descriptor in the following way

<casProcessor deployment="integrated" name="BioNLP09Writer"> -->
            <descriptor>
                <import name="de.julielab.jcore.consumer.bionlp09event.desc.jcore-bionlp09event-consumer" />
            </descriptor>
            <deploymentParameters />
            <configurationParameterSettings>
                <nameValuePair>
                    <name>Directory</name>
                    <value>
                        <string>st09-bionlp09-format</string>
                    </value>
                </nameValuePair>
            </configurationParameterSettings>
            <errorHandling>
                <errorRateThreshold action="terminate" value="10/1000" />
                <maxConsecutiveRestarts action="terminate"
                    value="30" />
                <timeout max="100000" default="-1" />
            </errorHandling>
            <checkpoint batch="1" time="1000ms" />
        </casProcessor>

on initialization of the CPE via the runCPE.sh script I get the message

<stack trace>
Caused by: java.lang.Exception: The component BioNLP09Writer cannot be created. (Thread Name: main)

PMC reader check section offset

When indexing for semedico, apparently every document print a warning like this:

11:30:21.091 [[Procesing Pipeline#2 Thread]::] WARN d.j.j.c.e.AbstractFieldsGenerator - Section annotation in document 4321653 occured with begin=8006 and end=9270 (document text length: 9270). Ignoring

Check if the offsets of the sections are correct.

FileReader test failing

The test fails due to the existence of .gitignore in the input directory. Has been fixed in the 2.1.x-bugfixes branch which would have to be merged.

Descriptor Versions wrong

Basically the version mentioned in all UIMA descriptors are wrong (still at 2.0.0 or even 2.0.0-SNAPSHOT). There should be some sort of script that automatically adapts those versions for all component and type descriptors.

BioNLP09Consumer: Create output dirs

The consumer does apparently not create the directories to the output directory itself. This is weird when using the default configuration. It should just work out-of-the-box, the output directories should automatically be created, if not existing.

jcore-txt-consumer: allow plain CAS text output

Sometimes it is important to have access to the exact CAS string as it is stored within UIMA. For instance, when outputting the text with annotations in an offset-based manner, the original document string must not be changed.

JSBD: Set document structure borders

JSBD should be able to be given a number of annotation types where sentences should always end (titles, sections etc). This would fix a range of issues in full texts.

muc7-reader

added a script that converts sgml to xml files expected by the reader; however the script is just rudimentary right now, but works for all files of the following structure:

<DOC>
<DOCID> nyt960214.0765 </DOCID>
<STORYID cat=a pri=r> A4505 </STORYID>
<SLUG fv=ttx-z> BC-<COREF ID="1">PANTEX</COREF>-<COREF ID="3">FLIGHTS</COREF>-TEX </SLUG>
<DATE> <COREF ID="104">02-14</COREF> </DATE>
<NWORDS> 0535 </NWORDS>
<PREAMBLE>
[...] 

The script takes as argument the name of the file to convert
python muc7_SGML2XML.py training.tr.keys.980410
and produces a file with the same name but an additional ".xml" ending.

But:
it seems the reader doesn't annotate coreferences in the CAS? Need to investigate!

Redundant annotation type lemma?

In the type system (jcore-morpho-syntax-types.xml), "lemma" is a feature of the annotation type token. The value of the lemma feature has to be of type lemma which in turn has only one feature of type string.

Why not deleting this intermediate lemma type and set the lemma feature of the type token to be a string right away?

Diverging folder structure

The folder structure in jcore-base/jcore-opennlp-token-ae/ does not conform to most of the other components.

Also, the readme file needs editing ;-)

mstparser needs new models

mstparser now depends on trove4j and not jules-trove
therefore we need a new test-model
this will also affect the mstparser projects in jcore-projects
we need a new genia model there as well

JCoRe Readme's!

@JULIELab/core Just a sidenote (but an important one): if you add a component to the repository or even if you "only" add features to a component, please create/update the readme's accordingly.

jsbd/jtbd: mallet

they both utilize now the "official mallet" artifact from Maven Central and not our "homebrewn" mallet version. The new performance for the FraMed corpus is documented. Need to check for e.g. Genia as well?!

jcore-stanford-lemmatizer...ae?

All (?) AEs have some "ae" in the name, but the stanford lemmatizer has not. Is there a particular reason for that? I think it was jcore-stanford-lemmatizer-ae in the past it has been changed. AE or not AE, that is the question!

Add an XMI reader

We have the writer in JCoRe, but we cannot easily read the XMIs. There is an XMI reader in Jules...

JEmAs -> jcore-projects

Just for the sake of completion and consistency (as discussed the other day): we should outsource the lexicon (plus descriptor) to a jcore-projects project; similar to e.g. jcore-jsbd has two referencing projects medical and biomedical.

retrain jsbd, jtbd & jpos models

due to changes in the names for some packages, we need to retrain the models (biomed & medical) and deploy them with new projects for 2.3.0

  • jsbd FRAMED
  • jtbd FRAMED
  • jpos FRAMED
  • jsbd BioMed
  • jtbd BioMed

trove4j version conflict

jcore-opennlp-token-ae depends on trove4j 3.0.3
jcore-jtbd-ae depends on trove4j 2.0.2 through jcore-mallet 2.0.9

... causes java.lang.ClassNotFoundException: gnu.trove.TObjectIntHashMap when both tokenizers are used in one project.

descriptor "path" in readme

For easier reference add the descriptor path of a component (e.g. de.julielab.jcore.ae.[...]) to the top line of each readme.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.