wing-nus / neural-parscit Goto Github PK

This project forked from opensourceware/neural-parscit

Neuralized version of the Reference String Parser component of the ParsCit package.

Home Page: http://wing.comp.nus.edu.sg/parsCit

License: Other

Python 85.21% Perl 14.39% Dockerfile 0.40%

citation-parsing digital-libraries scholarly-articles scholarly-metadata bilstm-crf-model bilstm-crf natural-language-processing deep-learning

neural-parscit's Introduction

Neural ParsCit

This is the official repository of Neural-ParsCit and is under active development at National University of Singapore (NUS), Singapore

Neural ParsCit is a citation string parser which parses reference strings into its component tags such as Author, Journal, Location, Date, etc. Neural ParsCit uses Long Short Term Memory (LSTM), a deep learning model to parse the reference strings. This deep learning algorithm is chosen as it is designed to perform sequence-to-sequence labeling tasks such as ours. Input to the model are word embeddings which are vector representation of words. We provide word embeddings as well as character embeddings as input to the network.

Initial setup

To use the tagger, you need Python 2.7 (works in Python 3 but not fully supported), with Numpy, Theano and Gensim installed. scikit-learn is needed for model evaluation if you are training a new model.

You can use environmental variables to set the following:

MODEL_PATH: Path to the model's parameters
WB_PATH: Path to the word embeddings
TIMEOUT: Timeout for gunicorn when starting the Flask app. Increase this if you experience the Flask app is unable to start as the model building process takes too long. [Default: 60]
NUM_WORKERS: Number of workers which gunicorn spawns. [Default: 1]

Using virtualenv in Linux systems

virtualenv -ppython2.7 .venv
source .venv/bin/activate
pip install -r requirements/<env>.txt

Where <env> is {prod, dev, test}

Using Docker

Build the image: docker build -t theano-gensim - < Dockerfile
Run the repo mounted to the container: docker run -it -v $(pwd):/usr/src --name np theano-gensim:latest /bin/bash

Word Embeddings

The word embeddings do not come with this repository. You can obtain the word embeddings with <UNK> from WING website. Please read the next section on availability of <UNK> in word embeddings.

You will need to extract the content of the word embedding archive (vectors_with_unk.tar.gz) to the root directory for this repository by running tar xfz vectors_with_unk.tar.gz.

Embeddings Without `<UNK>`

If the word embeddings provided do not have <UNK>, your instance will not benefit from the lazy loading of the word vectors and hence the reduction of memory requirements.

Without <UNK>, at most 7.5 GB of memory is required as the entire word vectors need to be instantiated in memory to create the new matrix. Comparing with embeddings with <UNK>, which is much lower as it only requires at most 4.5 GB.

Parse citation strings

Command Line

The fastest way to use the parser is to run state-of-the-art pre-trained model as follows:

./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run shell
./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run file -i input_file -o output_file

The script can run interactively or input can be passed in a file. In the interactive session, the strings are passed one by one. The result is displayed on standard output. If the file option is chosen, the input is given in a file specified by -i option and the output is stored in the directed file. Using the file option, multiple citation strings can be parsed.

The state-of-the-art trained model is provided in the models folder and is named neuralParsCit. The binary file for word embeddings is provided in the docker image of the current version of neural ParsCit. The hyper parameter discarded is the number of embeddings not used in our model. Retained words have a frequency of more than 0 in the ACM citation literature from 1994-2014.

Using a Web Server

Note: This service is not Python 3 compatible due to unicode.

The web server (a Flask app) provides REST API.

Running the web server, docker run --rm -it -p 8000:8000 -e TIMEOUT=60 -v $(pwd):/usr/src --name np theano-gensim:latest /bin/bash

In the container, gunicorn -b 0.0.0.0:8000 -w $NUM_WORKERS --timeout $TIMEOUT run_app:app

The REST API documentation can be found at http://localhost:8000/docs

Train a model

To train your own model, you need to use the train.py script and provide the location of the training, development and testing set:

./train.py --train train.txt --dev dev.txt --test test.txt

The training script will automatically give a name to the model and store it in ./models/ There are many parameters you can tune (CRF, dropout rate, embedding dimension, LSTM hidden layer size, etc). To see all parameters, simply run:

./train.py --help

Input files for the training script have to follow the following format: each word of the citation string and its corresponding tag has to be on a separate line. All citation strings must be separated by a blank line.

Details about the training data, experiments can be found in the following article. Training data and CRF baseline can be downloaded from https://github.com/knmnyn/ParsCit. Please consider citing following publication(s) if you use Neural ParsCit:

@article{animesh2018neuralparscit,
  title={Neural ParsCit: A Deep Learning Based Reference String Parser},
  author={Prasad, Animesh and Kaur, Manpreet and Kan, Min-Yen},
  journal={International Journal on Digital Libraries},
  volume={19},
  pages={323-337},
  year={2018},
  publisher={Springer},
  url={https://link.springer.com/article/10.1007/s00799-018-0242-1}
}

neural-parscit's People

Contributors

Stargazers

Watchers

Forkers

nsorros fluffels trungtv kjhenner chazu bipasha-banerjee dragomirradev huydhn annihu grand-home-projects cs4248-team9 honhaochen project-renard-survey joskid robertwalsh0

neural-parscit's Issues

Small model needed for integration tests

Currently most of the integration tests such as APIs are disabled during CI because the model takes more than 4 GB to run and Travis is unable to fulfill the memory requirements.

Hence, when the PyTorch model is implemented, a small reference model needs to be prepared for integration tests.

Missing web server dependencies in Dockerfile

I have a issue while using the below command.
/var/www/html/Neural-ParsCit$ sudo docker run --rm -it -p 8000:8000 -e TIMEOUT=60 -v $(pwd):/usr/src --name np theano-gensim:latest /bin/bash
root@46ecc721bd85:/usr/src# gunicorn -b 0.0.0.0:8000 -w $NUM_WORKERS --timeout $TIMEOUT run_app:app
bash: gunicorn: command not found
root@46ecc721bd85:/usr/src# gunicorn -b 0.0.0.0:8080 -w $NUM_WORKERS --timeout $TIMEOUT run_app:app
bash: gunicorn: command not found

How to run SciWing on GPU?

This package is amazing, so thank you so much! I'm trying to parse a bunch of documents and I noticed that despite a GPU being available and torch recognizing it, the sciwing package won't run its models on the GPU. For example, if I run this code:

    neural_parscit = NeuralParscit()   
    device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
    neural_parscit.to(device)

I get the following error suggesting that the model won't put the data on to the GPU.

  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/models/neural_parscit.py", line 164, in predict_for_text
    predictions = self._predict(line=text)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/models/neural_parscit.py", line 111, in _predict
    predictions = self.infer.on_user_input(line=line)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/infer/seq_label_inference/seq_label_inference.py", line 285, in on_user_input
    return self.infer_batch(lines=[line])
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/infer/seq_label_inference/seq_label_inference.py", line 298, in infer_batch
    model_output_dict = self.model_forward_on_lines(lines=lines_)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/infer/seq_label_inference/seq_label_inference.py", line 140, in model_forward_on_lines
    model_output_dict = self.model(
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/models/rnn_seq_crf_tagger.py", line 115, in forward
    encoding = self.rnn2seqencoder(lines=lines)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/modules/lstm2seqencoder.py", line 121, in forward
    embeddings = self.embedder(lines=lines)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/modules/embedders/concat_embedders.py", line 46, in forward
    embedding = embedder(lines)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/sciwing/modules/embedders/trainable_word_embedder.py", line 75, in forward
    embedding = self.embedding(numericalized_tokens)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 156, in forward
    return F.embedding(
  File "/opt/anaconda/envs/sciwing/lib/python3.8/site-packages/torch/nn/functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

Is there a standard way of having SciWing run on a GPU? I couldn't find anything in the documentation or poking around. Thanks!

Broken link to word embeddings

The current the link to the vector file isn't working:
http://wing.comp.nus.edu.sg/~wing.nus/resources/NParsCit/vectors_with_unk.tar.gz

Neural-ParsCit

While run the below command,
./run.py --model_path models/neuralParsCit/ --pre_emb vectors_with_unk.kv --run file -i 05_Ref.txt -o output.text
I got the below issues could you please suggest me what i have made wrong
from: too many arguments
import: unable to open X server ' @ error/import.c/ImportImageCommand/364. import: unable to open X server ' @ error/import.c/ImportImageCommand/364.
import: unable to open X server ' @ error/import.c/ImportImageCommand/364. import: unable to open X server ' @ error/import.c/ImportImageCommand/364.
import: unable to open X server ' @ error/import.c/ImportImageCommand/364. import: unable to open X server ' @ error/import.c/ImportImageCommand/364.
from: too many arguments
from: too many arguments
from: too many arguments
from: too many arguments
./run.py: line 12: $'\r': command not found
./run.py: line 13: $'\r': command not found
./run.py: line 14: syntax error near unexpected token (' '/run.py: line 14: optparser = optparse.OptionParser()

Extract Citation Contexts

There is a popular feature request to extract 'citation context' that is text surrounding the citation marker in the article's full body text for each of the citation parsed in the References / Bibliography.

Source for Citation Context Extraction in ParsCit:
https://github.com/knmnyn/ParsCit/blob/master/lib/ParsCit/CitationContext.pm

Improve usability and ease of installation

Problems

The peak memory usage exceed 16 GB when building the model, which results in long building time.
gensim.models.word2vec.Word2Vec has deprecated load_word2vec_format.
Python dependencies file is missing.

Where is vectors.bin?

I have built the docker image and am running the container interactively, but I don't have the vectors.bin file necessary to use the state-of-the-art model that is saved to the repository. The Readme says that the vectors.bin file is in the docker image, but does that mean when I run the session interactively I should see vectors.bin, or does that mean I need to pull from some other docker image because the Dockerfile is not referencing anything I would believe has the vectors.bin file.

Problem while Train the Model

When i try to train the model i am facing the below issue, please help me to sort.

root@02dae0d0158a:/usr/src# ./train.py --train train.txt --dev dev.txt --test test.txt
[INFO] 2019-03-27 06:05:13,617: Model location: ./models/lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=
True,pre_emb=,all_emb=False,cap_dim=0,crf=True,dropout=0.5,lr_method=sgd-lr_.005
Found 2 unique words (8 in total)
Found 8 unique characters
Found 2 unique named entity tags
[INFO] 2019-03-27 06:05:13,623: 8 / 0 / 0 sentences in train / dev / test.
[INFO] 2019-03-27 06:05:13,623: Saving the mappings to disk...
Traceback (most recent call last):
File "./train.py", line 200, in
f_train, f_eval = model.build(**parameters)
File "/usr/src/model.py", line 177, in build
pretrained = self.load_word_embeddings(pre_emb)
File "/usr/src/model.py", line 425, in load_word_embeddings
raise IOError("{embeddings} cannot be found.".format(embeddings=embeddings))
IOError: cannot be found.

Model saving and loading

Saving and Loading model

Reference: PyTorch

Applications

Saving/loading of model for #17 as we want to monitor how the performance of the model against validation/test data progresses over the training pass/epoch.
For inference with Flask (REST API)

Unable to do offline inference

Using run.py or tagger.py with a model that has embedding augmented will result in IndexOutOfBound error.

index errors

I came across index errors for some files. I examined these files and no common feature has been identified. The error is as below. Did anyone saw the same error?

IndexError: index 1279213 is out of bounds for size 1278786
Apply node that caused the error: AdvancedSubtensor1(word_layer__embeddings, word_ids)
Toposort index: 16
Inputs types: [TensorType(float64, matrix), TensorType(int32, vector)]
Inputs shapes: [(1278786, 500), (537,)]
Inputs strides: [(4000, 8), (4,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Join(TensorConstant{1}, AdvancedSubtensor1.0, AdvancedSubtensor.0, AdvancedSubtensor.0, AdvancedSubtensor1.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
File "parscit/run.py", line 42, in
f = model.build(training=False, **model.parameters)
File "parscit/model.py", line 223, in build
word_input = word_layer.link(word_ids)
File "parscit/nn.py", line 101, in link
self.output = self.embeddings[self.input]

training data

Great work, Could you please share the training corpus and test corpus in your experiment? I also want to reproduct it agian, Thanks!

Add the link to download the new trained word vectors

The binary form of the word vectors will no longer be usable in 1.0.2 as the word vectors are loaded with KeyedVectors with mmap for performance and reduce memory consumption.

The new word vector download link will be added to README.md.

TypeError: 'bool' object has no attribute 'getitem'

Version 1.0.5 currently fails with the following stack trace:

Traceback (most recent call last):
  File "run.py", line 71, in <module>
    data = prepare_dataset(test_sentences, word_to_id, char_to_id, lower, True)
  File "~/src/Neural-ParsCit/loader.py", line 151, in prepare_dataset
    tags = [tag_to_id[w[-1]] for w in s]
TypeError: 'bool' object has no attribute '__getitem__'

The cause of the problem is a mismatch between the parameter list:

loader.py:128: def prepare_dataset(sentences, word_to_id, char_to_id, tag_to_id, lower=False, zeros=False):

And the call site:

run.py:71: data = prepare_dataset(test_sentences, word_to_id, char_to_id, lower, True)

Version 1.0.4 does not have this issue.

Unable to train using own data

Hi,

Thanks for all your great work with this tool. I am having an issue training Neural Parscit using my own training data.

I have downloaded the word embeddings from http://wing.comp.nus.edu.sg/~wing.nus/resources/NParsCit/vectors_with_unk.tar.gz and they are located under the main folder.

I am able to parse citations using the command line as described:
./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run shell

From what I understand the training data should be in the following form with a blank line between citations:

M. author
Fedoryszak, author
Large title
Scale title
Citation title
Matching title
Using title
Apache title
Hadoop title
in booktitle
International booktitle
etc.

When I run
./train.py --train train.txt --dev dev.txt --test test.txt
I get the error
raise IOError("{embeddings} cannot be found.".format(embeddings=embeddings))

I understand this is to do with not providing the location of the embeddings but when I run
./train.py --pre_emb vectors_with_unk.kv --train train.txt --dev dev.txt --test test.txt
I get the error:
Traceback (most recent call last): File "train.py", line 201, in f_train, f_eval = model.build(**parameters) File "/Neural-ParsCit/model.py", line 208, in build new_weights[i] = pretrained[word] ValueError: could not broadcast input array from shape (500) into shape (100)

Any help greatly appreciated.

Reimplement command line parsing

optparse is deprecated in Python 2.7. Hence, the current command line parser will need to be re-written with argparse which is supported in Python 3.x.

Features

Training of model
- seed --seed
- use CUDA --cuda
- pretrained --pretrained
- learning rate --lr
- optimizer --optim
- log level --log; default warn
- pretrained embeddings --pretrained
Inference
- Interactive -i
- File -f

Function uses old way to load word vectors

Refer to this

Can we use neural parscit to parse .pdf documents?

I have a list of academic documents in the .pdf format. I want to extract citations from these documents. However, I am wondering whether we can use neural parscit for the .pdf documents. If yes, how can we use neural parscit on the .pdf files?

Output result is not same with online demo.

Hi , much thanks for this great work.

I run this using the instructions from readme file and get a different output comparing with Online version .
And comparing with online demo, run.py provided cannot easily combine words and tags to citations.

The attachment bellow is used to test.
pdf_text_for_test.txt

Tensorboard Integration and logging

Metrics

Per Checkpoint

Log Likelihood
Per dataset {validation, test}, per label F1: recall and precision using add_scalars

Web server Integration issue

***While using the below command i am getting issue
root@13d0d88a6545:/usr/src# gunicorn -b 0.0.0.0:8000 -w 1 --timeout 60 run_app:app
[2019-03-29 10:45:33 +0000] [34] [INFO] Starting gunicorn 19.9.0
[2019-03-29 10:45:33 +0000] [34] [INFO] Listening at: http://0.0.0.0:8000 (34)
[2019-03-29 10:45:33 +0000] [34] [INFO] Using worker: sync
[2019-03-29 10:45:33 +0000] [38] [INFO] Booting worker with pid: 38
[INFO] 2019-03-29 10:45:49,179: Loading model from /usr/src/models/neuralParscit and using word embeddings from /usr/src/vectors_with_unk.kv
[2019-03-29 10:45:49 +0000] [38] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
worker.init_process()
File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/base.py", line 129, in init_process
self.load_wsgi()
File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/base.py", line 138, in load_wsgi
self.wsgi = self.app.wsgi()
File "/usr/local/lib/python2.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python2.7/site-packages/gunicorn/app/wsgiapp.py", line 52, in load
return self.load_wsgiapp()
File "/usr/local/lib/python2.7/site-packages/gunicorn/app/wsgiapp.py", line 41, in load_wsgiapp
return util.import_app(self.app_uri)
File "/usr/local/lib/python2.7/site-packages/gunicorn/util.py", line 350, in import_app
import(module)
File "/usr/src/run_app.py", line 8, in
app = create_app(CONFIG)
File "/usr/src/app/init.py", line 26, in create_app
model, inference = get_model(model_path, word_emb_path)
File "/usr/src/app/utils.py", line 7, in get_model
g.model = Model(model_path=model_path)
File "/usr/src/model.py", line 55, in init
with open(self.parameters_path, 'rb') as f:
IOError: [Errno 2] No such file or directory: '/usr/src/models/neuralParscit/parameters.pkl'
[2019-03-29 10:45:49 +0000] [38] [INFO] Worker exiting (pid: 38)
[2019-03-29 10:45:49 +0000] [34] [INFO] Shutting down: Master
[2019-03-29 10:45:49 +0000] [34] [INFO] Reason: Worker failed to boot.