Giter VIP home page Giter VIP logo

europeananp-ner's Introduction

Named Entity Recognition Tool for
Europeana Newspapers Build Status

This tool takes container documents (MPEG21-DIDL, METS), parses all references to ALTO files and tries to find named entities in the pages (with most models: Location, Person, Organisation, Misc). The aim is to keep the physical location on the page available through the whole process to be able to highlight the results in a viewer.

Read more about it on the KBNLresearch blog.

Stanford NER is used for tagging. The goal during development was to use 'loose coupling', this enables us to quickly inherit/benefit from upstream development. Most of the development is done at the research department of the KB, national library of the Netherlands. If you are looking for a project which does more interaction with the core of Stanford-NER, take a peek at the project from our colleagues INL, Institute for Dutch Lexicology INL-NERT, although they are separate branches now, there is a desire to integrate both in the future.

This version is not longer maintained, for a maintained version go here: https://github.com/EuropeanaNewspapers/ner-app

Input formats

The following input formats are implemented:

  • ALTO 1.0
  • HTML
  • Mets
  • MPEG21 DIDL
  • Text

Output formats

The following output formats are implemented:

Building

Building from source:

Install Maven, Java (v1.7 and up). Clone the source from github, and in the toplevel directory run:

mvn package

This command will generate a JAR and a WAR of the NER located in the target/ directory. To deploy the WAR, just copy it into the Tomcat webapp directory, or use Tomcat manager to do it for you.

Or move quickly and run (on *nix systems):

git clone https://github.com/KBNLresearch/europeananp-ner.git
cd europeananp-ner/
./go.sh

Usage command-line-interface

Invoking help:

java -jar NerAnnotator.jar --help

usage: java -jar NerAnnotator.jar [OPTIONS] [INPUTFILES..]
-c,--container <FORMAT>             Input type: mets (Default), didl,
                                    alto, text, html
-d,--output-directory <DIRECTORY>   output DIRECTORY for result files.
                                    Default ./output
-f,--export <FORMAT>                Output type: log (Default), csv,
                                    html, db, alto, alto2_1, alto3, bio.
                                    Multiple formats:" -f html -f csv"
-l,--language <ISO-CODE>            use two-letter ISO-CODE for language
                                    selection: en, de, nl ....
-m,--models <language=filename>     models for languages. Ex. -m
                                    de=/path/to/file/model_de.gz -m
                                    nl=/path/to/file/model_nl.gz
-n,--nthreads <THREADS>             maximum number of threads to be used
                                    for processing. Default 8

If there are no input files specified, a list of file names is read from stdin.

Example invocation for classification of german_alto.xml:

java -Xmx800m -jar NerAnnotator.jar -c mets -f alto -l de -m de=./test-files/german.ser.gz -n 2 ./test-files/german_alto.xml

The given example takes the language model called 'german.ser.gz' and applies it to 'german_alo.xml' using 2 threads, and container type METS.

Usage web-interface

Webinterface standalone:

mvn jetty:run

This will try to bind to port 8080, using Jetty.

Once deployed to Tomcat the following applies. The default configuration (as well as test-classifiers) reside in src/main/resources/config.ini, this file references the available classifiers.

See the provided sample for some default settings. The landing page of the application will show the available options once invoked with the browser. The config.ini and the classifiers will end up in WEB-INF/classes/, after deployment.

Working with classifiers and binary model generation

To be able to compare your results with a baseline we provide some test files located in the test-files directory.

To run a back-to-front test try:

cd test-files;./test_europeana_ner.sh

The output should look something like:

Generating new classification model. (de)
-rw-rw-r-- 1 aloha aloha 1.4M Sep 11 15:55 ./eunews_german.crf.gz

real	0m3.984s
user	0m5.452s
sys	0m0.235s
Applying generated model (de).

Results:
    Locations: 4
    Organizations: 0
    Persons: 1071

real	0m13.512s
user	0m17.771s
sys	0m0.336s

Generating new classification model. (nl)
-rw-rw-r-- 1 aloha aloha 1.7M Sep 11 15:56 ./eunews_dutch.crf.gz

real	0m8.816s
user	0m10.437s
sys	0m0.371s
Applying generated model (nl).

Results:
    Locations: 1
    Organizations: 8
    Persons: 0

real	0m5.048s
user	0m9.278s
sys	0m0.233s

To generate a binary classification model, use the following command:

cd test-files; java -Xmx5G -cp ../target/NerAnnotator-0.0.2-SNAPSHOT-jar-with-dependencies.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen_dutch.prop

This should result in a file called eunews_dutch.crf.gz, with a file-size of +/- 1MB.

To verify the NER software, use the created classifier to process the provided example file.

cd test-files; java -jar ../target/NerAnnotator-0.0.2-SNAPSHOT-jar-with-dependencies.jar -c alto -d out -f alto -l nl -m nl=./eunews_dutch.crf.gz -n 8 ./dutch_alto.xml

Resulting in a directory called out containing ALTO files with inline annotation.

General remarks on binary classification model generation

The process of generating a binary classification model is a delicate one. The input .bio file needs be as clean as possible to prevent the garbage in-out rule from happening. Thus, use noise filters while creating .bio files.

Gazette's greatly improve the quality of your classification process, but a big model in memory may slow down processing speed. Overall there is a strong correlation in model size and performance.

The Stanford NER package offers a lot of settings that can influence the binary model generation process. These settings can be configured using austen.prop, For more information on the Stanford settings see Stanford NER FAQ.

Binary classification models generated with this tool are fully compatible with the upstream version of the Stanford NER.

europeananp-ner's People

Contributors

cneud avatar stitchplus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

europeananp-ner's Issues

Bio as output format

Support BIO output format directly without intermediate step.

It should support fiddling parameters from command line (like min sentence length and min entity count) to filter out noise.

Update master

When the current 0.0.2 branch has stabilised, we should replace the master branch with it. In the meantime the package name has changed, so to keep the history in good shape, it would be best to follow what is described here: http://stackoverflow.com/a/2763118/1919465.

Tag reference bug.

In some cases the Tag0 get's handed out multiple times, have to figure out why this happens and fix it.

Replace jsoup

Really, jsoup which is currently used for parsing the Alto should be replaced with a library for XML manipulation such as org.jdom2 or similar.

ALTO Export

The tool rewrites the ALTO tree as a new XML document after recognition. This can cause undesired differences (e.g. line breaks etc.) from the original document. Ideally the tool should only update the original input ALTO document with the additonal NER information.

Update readme

Readme should be updated with options and pointers to training data / classifiers.

Hyphenation error

Strings that are hyphenated EG split in 2 seperate ALTO string blocks, are not tagged correctly.
(the first one is tagged, second one skipped)

Logging

Logging is currently done mixed, as in

System.out.prtinf
and
logging.info

merge into logging.info format.

NER generation metadata to output

The metadata on generation of the entities should be reflected in the output XML.

The following should be added (as a comment line) to the output XML:

  • Version of NER tool
  • Checksum of Java binary NER tool
  • Filename (and path) of used classifier
  • Checksum of used classifier

Mavenize INL stanford as external dependency

The modified INL stanford-corenlp must be refactored to a different package name to avoid namespace conflicts with stanford-corenlp vanilla and the artifacts published via Maven.

Alto 3

Remove legacy 2_1, and implement 3

ALTO 2.1

ALTO version 2.1 is supposed to introduce a valid mechanism for named entities encoding in ALTO. As soon as this is official, ALTO 2.1 should be implemented as output format and superseed the current ALTO-with-Alternatives workaround. See http://www.loc.gov/standards/alto/.

Errors in plain text processing

Processing a plain text input file (using either html or bio as output format) fails with the following stacktrace:

java -Xmx800m -jar NerAnnotator-0.0.2-SNAPSHOT-jar-with-dependencies.jar -c text -f html -l fr -m fr=eunews.fr.crf.gz -n 1 test.txt Container format: text Loading language model for Französisch fr -> eunews.fr.crf.gz Done loading classifier test.txt Processing TEXT-File test.txt Trying to process ALTO file file:/c:/ENP/test/test.txt java.lang.NullPointerException at nl.kbresearch.europeana_newspapers.NerAnnotator.output.HtmlResultHandler.addToken(HtmlResultHandler.java:78) at nl.kbresearch.europeana_newspapers.NerAnnotator.alto.TxtProcessor.handlePotentialTextFile(TxtProcessor.java:83) at nl.kbresearch.europeana_newspapers.NerAnnotator.container.TextProcessor.processFile(TextProcessor.java:37) at nl.kbresearch.europeana_newspapers.NerAnnotator.container.ContainerHandleThread.call(ContainerHandleThread.java:40) at nl.kbresearch.europeana_newspapers.NerAnnotator.container.ContainerHandleThread.call(ContainerHandleThread.java:15) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Total processing time: 8718 0 container documents successfully processed, 1 with errors. There were errors while processing.

Encoding issue

CONTENT="Bote & quot;.)"

Get's in the output as:

<NamedEntityTag ID="Tag4" LABEL="Bote") Wien Oktober Europa"

This needs fixing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.