Giter VIP home page Giter VIP logo

nxml2fries's Introduction

nxml2fries

Converts PubMed NXML format into (almost) raw text to be used for NL analysis.

Requirements

This software assumes a unix-like environment and that you have a working installation of nxml2txt added to your path. Besided that you only need Python.

Installation Instructions for Mac

  1. Install MacPorts from: https://www.macports.org
  2. Install python 2.7 and set it as the default:
sudo port install python27
sudo port select --set python python27
  1. Install the necessary dependencies for nxml2txt:
sudo port install texlive-latex texlive-latex-recommended texlive-latex-extra py-lxml
  1. Install nxml2txt and add the binary to your $PATH:
git clone https://github.com/spyysalo/nxml2txt.git
cd nxml2txt
chmod 755 nxml2txt nxml2txt.sh
export PATH=<PATH WHERE nxml2txt IS INSTALLED>:$PATH
  1. Install this project:
git clone https://github.com/sistanlp/nxml2fries.git

Usage

./nxml2fries [--no-citations] arg1.nxml [... argn.nxml]
  • --no-citation: If enabled, this option removes the reference citations from the text and replaces the space they used by white-spaces characters.

  • argn.nxml: The nxml file or list of files to operate over.

Output format

Each output file is in tab-separated-values format. Its fields are:

  • Paragraph ID: A unique id in the document assigned to the corrent paragraph
  • Section ID: The id of the section in the paper.*
  • Normalized section name: A normalized version of the section to create equivalence classes of sections between papers. For example, Materials/Methods and Materials and methods would have the same normalized name materials-methods.*
  • Is title: Wether the text in the line is the title of a section/paper/figure. 1 for true and 0 for false.
  • Text: The text of the current paragraph/figure/reference.

* If this field has no information, it's content will be N/A.

Examples

ID sec_id sec_norm Is title Text
52 s2f N/A 1 Biochemical analyses

The title for section s2f.

ID sec_id sec_norm Is title Text
59 s2g materials-methods 0 To measure the effect of Ras on PI3KC2beta ...

A paragraph of section-id s2g, with a normalized section.

ID sec_id sec_norm Is title Text
96 references references 0 1 Karnoub AE , Weinberg RA ( 2008 ) Ras oncogenes: split personalities . Nat Rev Mol Cell Biol 9 : 517 - 531 18568040

One of the references of a paper

ID sec_id sec_norm Is title Text
25 fig-4 fig-4 1 ITSN1 and Ras form a BiFC complex.

A figure's title

nxml2fries's People

Contributors

enoriega avatar marcovzla avatar mihaisurdeanu avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.