Converts PubMed NXML format into (almost) raw text to be used for NL analysis.
This software assumes a unix-like environment and that you have a working installation of nxml2txt added to your path. Besided that you only need Python.
- Install MacPorts from: https://www.macports.org
- Install
python
2.7 and set it as the default:
sudo port install python27
sudo port select --set python python27
- Install the necessary dependencies for
nxml2txt
:
sudo port install texlive-latex texlive-latex-recommended texlive-latex-extra py-lxml
- Install
nxml2txt
and add the binary to your$PATH
:
git clone https://github.com/spyysalo/nxml2txt.git
cd nxml2txt
chmod 755 nxml2txt nxml2txt.sh
export PATH=<PATH WHERE nxml2txt IS INSTALLED>:$PATH
- Install this project:
git clone https://github.com/sistanlp/nxml2fries.git
./nxml2fries [--no-citations] arg1.nxml [... argn.nxml]
-
--no-citation: If enabled, this option removes the reference citations from the text and replaces the space they used by white-spaces characters.
-
argn.nxml: The nxml file or list of files to operate over.
Each output file is in tab-separated-values format. Its fields are:
- Paragraph ID: A unique id in the document assigned to the corrent paragraph
- Section ID: The id of the section in the paper.*
- Normalized section name: A normalized version of the section to create equivalence classes of sections between papers. For example, Materials/Methods and Materials and methods would have the same normalized name materials-methods.*
- Is title: Wether the text in the line is the title of a section/paper/figure. 1 for true and 0 for false.
- Text: The text of the current paragraph/figure/reference.
* If this field has no information, it's content will be N/A.
ID | sec_id | sec_norm | Is title | Text |
---|---|---|---|---|
52 | s2f | N/A | 1 | Biochemical analyses |
The title for section s2f.
ID | sec_id | sec_norm | Is title | Text |
---|---|---|---|---|
59 | s2g | materials-methods | 0 | To measure the effect of Ras on PI3KC2beta ... |
A paragraph of section-id s2g, with a normalized section.
ID | sec_id | sec_norm | Is title | Text |
---|---|---|---|---|
96 | references | references | 0 | 1 Karnoub AE , Weinberg RA ( 2008 ) Ras oncogenes: split personalities . Nat Rev Mol Cell Biol 9 : 517 - 531 18568040 |
One of the references of a paper
ID | sec_id | sec_norm | Is title | Text |
---|---|---|---|---|
25 | fig-4 | fig-4 | 1 | ITSN1 and Ras form a BiFC complex. |
A figure's title