Giter VIP home page Giter VIP logo

lstm-ner's Introduction

LSTM-NER

Introduction

An implementation of LSTM named entity recognition based on Keras. Using two kinds of embeddings as a representation of characters, they are char-embeddings and char-postion-embeddings.

Inspired by the work of Nanyun Peng and Mark Dredze. The idea of using different kinds of embeddings in a NER task is very brilliant.

And I used the same embeddings provided by their open source repo.

Reference

Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings

Nanyun Peng and Mark Dredze

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015

If you use the code, please kindly cite the following bibtex:

@inproceedings{peng2015ner, title={Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings.}, author={Peng, Nanyun and Dredze, Mark}, booktitle={EMNLP}, pages={548–-554}, year={2015} }

Dependecies

Because this project used Tensorflow API, it requires installation of Tensorflow and some other python modules:

  • Tensorflow ( >= r1.1)

Both of them can be easily installed by pip.

Data Format

The data format is basically consistent with the CRF++ toolkit. Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. (So far, this program only supports data with 3-columns.)

To identify the boundary between sentences, an empty line is put.

Here's an example of such a file: (data for Chinese NER)

...
感	O
动	O
了	O
李	B-PER.NAM
开	I-PER.NAM
复	I-PER.NAM
感	O
动	O

回	O
复	O
支	O
持	O
...

Featrue template

In file template specificated the feature template which used in context-based feature extraction. The second line fields indicates the field name for each column of a token. And the templates described how to extract features.

For example, the basic template is:

# Fields(column), w,y&F are reserved names
w y
# templates.
w:-2
w:-1
w: 0
w: 1
w: 2

it means, each token will only has 2 columns data, 'w' and 'y'. Field y should always be at the last column.

Note that w y & F fields are reserved, because program used them to represent word, label and word's features.

Each token will become a dict type data like '{'w': '李', 'y': 'B-PER.NAM', 'F': ['w[-2]=动', 'w[-1]=了', ...]}'

The above templates describes a classical context feature template:

  • C(n) n=-2,-1,0,1,2

'C(n)' is the value of token['w'] at relative position n.

If your token has more than 2 columns, you may need change the fields and template depends on how you want to do extraction.

In this project, I disabled suffix of feature to extract words in a context window.

Embeddings

This program supports word/char embeddings input. When running this program, you should give a embedding file(word2vec standard output format) by specific argument.

  1. char-embeddings

    Embeddings learnt from each character in a large unlabled text.

  2. char-postion embeddings

    Character embeddings cannot distinguish between uses of the same character in different contexts, whereas word embeddings fail to make use of characters or character n-grams that are part of many words.

    'char-postion embeddings' is a compromise to use character embeddings that are sensitive to the character's position in the word.

All of those embeddings are trained on a large corpus of Weibo messages.

Usage

Environment settings

In env_settings.py file, there are some environment settings like 'output dir':

# Those are some IO files' dirs
# you need change the BASE_DIR on your own PC
BASE_DIR = r'project dir/'
MODEL_DIR = BASE_DIR + r'models/'
DATA_DIR = BASE_DIR + r'data/'
EMB_DIR = BASE_DIR + r'embeddings/'
OUTPUT_DIR = BASE_DIR + r'export/'
LOG_DIR = BASE_DIR + r'Summary/'

If your don't have those dirs in your project dir, just run python env_settings.py, and they will be created automatically.

Training

Just run the ./main.py file. Or specify some arguments if you need, like this:

python main.py --lr 0.005 --fine_tuning False --l2_reg 0.0002

Then the model will run on lr=0.005, not fine-tuning, l2_reg=0.0002 and all others default. Using -h will print all help informations. Some arguments are not useable now, but I will fix it as soon as possible.

python main.py -h
usage: main.py [-h] [--train_data TRAIN_DATA] [--test_data TEST_DATA]
               [--valid_data VALID_DATA] [--log_dir LOG_DIR]
               [--model_dir MODEL_DIR] [--model MODEL]
               [--restore_model RESTORE_MODEL] [--emb_file EMB_FILE]
               [--emb_dim EMB_DIM] [--output_dir OUTPUT_DIR]
               [--only_test [ONLY_TEST]] [--noonly_test] [--lr LR]
               [--dropout DROPOUT] [--fine_tuning [FINE_TUNING]]
               [--nofine_tuning] [--eval_test [EVAL_TEST]] [--noeval_test]
               [--max_len MAX_LEN] [--nb_classes NB_CLASSES]
               [--hidden_dim HIDDEN_DIM] [--batch_size BATCH_SIZE]
               [--train_steps TRAIN_STEPS] [--display_step DISPLAY_STEP]
               [--l2_reg L2_REG] [--log [LOG]] [--nolog] [--template TEMPLATE]

optional arguments:
  -h, --help            show this help message and exit
  --train_data TRAIN_DATA
                        Training data file
  --test_data TEST_DATA
                        Test data file
  --valid_data VALID_DATA
                        Validation data file
  --log_dir LOG_DIR     The log dir
  --model_dir MODEL_DIR
                        Models dir
  --model MODEL         Model type: LSTM/BLSTM/CNNBLSTM
  --restore_model RESTORE_MODEL
                        Path of the model to restored
  --emb_file EMB_FILE   Embeddings file
  --emb_dim EMB_DIM     embedding size
  --output_dir OUTPUT_DIR
                        Output dir
  --only_test [ONLY_TEST]
                        Only do the test
  --noonly_test
  --lr LR               learning rate
  --dropout DROPOUT     Dropout rate of input layer
  --fine_tuning [FINE_TUNING]
                        Whether fine-tuning the embeddings
  --nofine_tuning
  --eval_test [EVAL_TEST]
                        Whether evaluate the test data.
  --noeval_test
  --max_len MAX_LEN     max num of tokens per query
  --nb_classes NB_CLASSES
                        Tagset size
  --hidden_dim HIDDEN_DIM
                        hidden unit number
  --batch_size BATCH_SIZE
                        num example per mini batch
  --train_steps TRAIN_STEPS
                        trainning steps
  --display_step DISPLAY_STEP
                        number of test display step
  --l2_reg L2_REG       L2 regularization weight
  --log [LOG]           Whether to record the TensorBoard log.
  --nolog
  --template TEMPLATE   Feature templates

There has three type of model can be choosed by using argument '--model', they are:

  1. LSTM + CRF
  2. BiLSTM + CRF
  3. CNN + BiLSTM + CRF

Test

If you set 'only_test' to True or 'train_steps' to 0, then program will only do test process.

So you must give a specific path to 'restore_model'.

History

  • 2017-10-31 ver 0.2.2
    • Update Neural Text Process lib 0.2.0
    • Compatible modification in main file.
  • 2017-10-20 ver 0.2.1
    • Fix: Non-suffix for template in 'only test' process.
    • Fix: Now using correct dicts for embedding lookup table.
    • Fix: A bug of batch generator 'batch_index'.
  • 2017-09-12 ver 0.2.0
    • Update: process lib 0.1.2
    • Removed 'keras_src', completed the refactoring of the code hierarchy.
    • Added env_settings.py to make sure all default dirs exist.
    • Support restore model from file.
    • Support model selection.
  • 2017-07-06 ver 0.1.3
    • Add new method 'accuracy', which used to calculate correct labels
    • Arguments 'emb_type' & 'emb_dir' now are deprecated.
    • New argument 'emb_file'
  • 2017-04-11 ver 0.1.2
    • Rewrite neural_tagger class method: loss.
    • Add a new tagger based Bi-LSTM + CNNs, where CNN used to extract bigram features.
  • 2017-04-08 ver 0.1.1
    • Rewrite class lstm-ner & bi-lstm-ner.
  • 2017-03-03 ver 0.1.0
    • Using tensorflow to implement the LSTM-NER model.
    • Basical function finished.
  • 2017-02-26 ver 0.0.3
    • lstm_ner basically completed.
    • viterbi decoding algorithm and sequence labeling.
    • Pretreatment completed.
  • 2017-02-21 ver 0.0.2
    • Basical structure of project
    • Added 3 module file: features, pretreatment and constant
    • Pretreatment's create dictionary function completed
  • 2017-02-20 ver 0.0.1
    • Initialization of this project.
    • README file
    • Some util functions and basical structure of project

lstm-ner's People

Contributors

heshenghuan avatar

Watchers

James Cloos avatar Leyi Wang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.