Giter VIP home page Giter VIP logo

ade_ann's Introduction

CRF-LSTM models for sequence labeling in Text

This repository contains implementations of various CRF-LSTM models. More details about these models can be found in the paper: Structured Prediction Models for RNN based Sequence Labeling in clinical text, EMNLP 2016.

git clone https://github.com/abhyudaynj/LSTM-CRF-models.git

The original code for the paper was written in Theano and Numpy. This is a slightly more optimized version of the original code with all the main computations entirely in Theano\Lasagne. It also has additional support for incorporating handcrafted features and UMLS Semantic Types. UMLS Semantic types can be extracted by using the MetaMap software. MetaMap requires a UMLS License. To run without UMLS, keep the -umls option to 0.

Requirements

This code is build on the excellent work done by Theano and Lasagne folks. For other necessary packages see requirements file.

It is preferable to train the models on GPUs, in the interest of a managable training time.

Getting Started

After installing the dependencies and cloning the repository, please use the following steps.

Setting up the Input Directory

A toy sample dataset is provided in sample-data/ containing all possible types of input files. Files like 001 are raw text files. Only raw text files are needed to run the trained models on your data. To train the tagger model, you also need annotation files like 001.json. Annotation files are json files which contain list of [start char offset, end char offset, annotated text, annotation type, annotation id] objects. Please take a look at the functions file_extractor and annotated_file_extractor in bionlp/preprocess/extract_data.py for more details. Files like 001.umls.json contain MetaMap annotations for each file. These types of files are needed for training and deployement when umls option is set to 1. The model provided has this feature off, so you only need raw text files in your directory.

The tagger takes as input a file which contains a list of all raw text files to be processed. To generate this file use the following

python scripts/get_file_list.py -i [input directory] -o [file-list-output] -e -1

For option use

python scripts/get_file_list.py -h

Deploying the tagger

Trained model file can be obtained at

wget http://www.bio-nlp.org/external_user_uploads/skip-crf-approx.pkl 

UPDATE : Model file will be updated as various training runs for the deployment model finish. The latest model files will be updated at the same url.

Use the following statement to run the tagger on all the files in the file-list-output. The tagger will populate the output directory with json files, containing the predicted annotations.

python deploy.py -i [file-list-output] -d [output-directory] -model [model-file]

Training the tagger

For training the model, each "filename" in the input directory should have a "filename.json" in the same directory. Please take a look at the function annotated_file_extractor in bionlp/preprocess/extract_data.py. This is the function that will be extracting the raw text and annotations from your input files. You only need to provide annotations for the relevant labels, 'Outside' labels need not be provided. The model automatically assigns 'None' label to any token without an annotation. If you have set 'umls' option on, you need also *.umls.json files.

To run the training you also need a dependency.json file if you want to provide your own embeddings. A sample file is provided in dependency.json.sample. This file contains dependency paths. To initialize with your own embeddings,'mdl' in dependency.json should point to the binary model of wordvectors generated by Word2Vec software.

Once you have the input-file-list with annotation .json files in place and dependecies are set. Type the following to start the training

python train_crf_rnn.py -model [output-model-file] -i [file-list-output]

There are multiple parameters to tune. To check the options use

python train_crf_rnn.py -h 

Directory Structure

.
|-bionlp            # Main directory containing the package
    |-data          # Contains the class definitions for the data format.
    |-evaluate      # Contains the Evaluation and the Postprocessing Script
    |-modifiers     # Contains functions to modify the data format, usually by adding feature vectors. 
    |-preprocess    # Contains preprocess functions that extract raw text and annotation files. 
    |-utils         # Misc utility functions. 
    |-taggers       
        |-rnn_feature   # Contains the main tagger code.
            |-networks  # Contains the code for CRF-LSTM models. See provided README.md file for details on each model
|-scripts           # Utility scripts.
|-sample-data       # directory with input data format. json files without the umls extension are annotation files 
                      and only needed for training or evaluation during deploy. *.umls.json files are metamap annotation files

=======

ADE NER Experiment

Update

  • (26 Sep 2016) added a script to extract text files from CRIS database.

ade_ann's People

Contributors

abhyudaynj avatar chandrap08 avatar honghan avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

chandrap08

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.