Giter VIP home page Giter VIP logo

refext's Introduction

refext (reference extraction)

General

The goal of refext is to extract reference strings from research papers in the PDF format. It builds on CERMINE and its capabilities to extract text lines from PDFs while also including some layout information. Refext uses supervised conditional random fields (CRFs) to detect reference strings. This is done without first identifying the reference section based on heuristics.

Usage

There are two ways of using refext for extracting reference strings from PDF publications. Either by executing Main via the command line or by directly calling ReferenceExtractor class via Java code.

Via Command line

The Main class provides a number of parameters. Use the "-h" parameter to get an overview:

Usage: <main class> [options]
  Options:
  * -crfModel, --crf-model-path
      File containing a CRF model (see SupervisedCrfTrainer)
    -bibtex, --extract-bibtex-references
      will extract bibtex references
      Default: false
    -refs, --extract-reference-strings
      will extract reference strings
      Default: false
    -h, --help
      print information about available parameters
      Default: false
    -pdf, --input-pdf-path
      File or directory containing PDFs
    -layout, --layout-path
      File or directory where files contain lines and layout information (see
      CermineLineLayoutExtractor)
  * -outputDir, --output-directory
      Directory to store the output
    -sizeLimit, --pdf-file-size-limit
      limit in byte for pdf files
      Default: 10000000
    -skipExist, --skip-existing-ouput-files
      will skip files for which there is already an output file
      Default: false

Entries that are marked with a * are required. Also, either the -bibtex or the -refs option needs to be set. The input are either PDF files or layout files extracted with CermineLineLayoutExtractor.

The two main ways to use this main class are:

Via maven exec

  1. In the project root, run mvn compile
  2. To execute the reference extraction, run mvn exec:java -Dexec.mainClass="de.exciteproject.refext.Main" -Dexec.args="[add parameters listed above here]"

By creating a jar file with maven

  1. In the project root, run mvn package
  2. Use the generated jar file with java -jar refext-[version]-jar-with-dependencies.jar [add parameters listed above here]

Via Java code

Another way of running a reference extraction process using an existing CRF model is with the ReferenceExtractor class. The refext library can be imported to a maven project by adding the following entries to pom.xml (check the JitPack website for current releases):

<repositories>
  ...
  <repository>
    <id>jitpack.io</id>
    <url>https://jitpack.io</url>
  </repository>
</repositories>
<dependencies>
  ...
  <dependency>
    <groupId>com.github.exciteproject</groupId>
    <artifactId>refext</artifactId>
    <version>[current version]</version>
  </dependency>
</dependencies>

An example Java class:

import java.io.File;
import java.io.IOException;
import java.util.List;

import de.exciteproject.refext.ReferenceExtractor;
import pl.edu.icm.cermine.bibref.model.BibEntry;
import pl.edu.icm.cermine.exception.AnalysisException;

public class Example {
    public static void main(String[] args) throws IOException, AnalysisException {
        ReferenceExtractor referenceExtractor = new ReferenceExtractor(new File(args[0]));
        List<String> references = referenceExtractor.extractReferencesFromPdf(
                new File(args[1]));
        for (String reference : references) {
            System.out.println(reference);
        }
    }
}

Here, args[0] is a path to the trained CRF file and args[1] is a path to a PDF file from which the references should be extracted.

Training New Models

Training a new supervised CRF model consists of the following steps:

  1. Generate layout CSV files from given PDFs using CermineLineLayoutExtractor.
  2. Generate pre-annotated text from the layout files using TrainingDataAnnotator.
  3. Correct the annotated lines that belong to a reference string with the following XML tags:
    • <ref>: at the beginning of the first line of a reference string
    • </ref>: at the end of the last line of a reference string
    • <oth>: at the beginning of the first line of information that is appears in a reference string but which does not belong to it. For example, page numbers, headers, or footers when a reference string spans two pages
    • </oth>: at the end of the last line of other information inside a reference string
    • Important: Do not delete or add any lines. Otherwise, the next step will fail.
  4. Merge the layout files and the annotated files using LabelLayoutMerger.
    • The two files are merged based on their line numbers. Thereby, both documents need to have the same number of lines and each line in one document needs to correspond to the same text as the line with the same line number in the other document.
  5. Run the Main.java class for training.
    • -train and -test can point to the same directory. This way, the trained model will be evaluated on the same data as it was trained on.
    • -model is the file in which the trained model will be saved
    • -feat is a list of comma-separated features. See also (src/main/java/de/exciteproject/refext/train/pipe/FeaturePipeProvider.java)
    • A number of optional parameters for configuring the training of the crf model
    • -h for a list of all possible parameters, including the parameters for configuring the training of the crf model

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.