Giter VIP home page Giter VIP logo

gmb_corpus_ner's Introduction

GMB_corpus_ner

This is the repository fot the link https://www.kaggle.com/shoumikgoswami/ner-using-random-forest-and-crf It uses a ensemble model combines (xgboost,crf,random forest,bilist+attn+crf) and a model based on Bert(since my computer does not support trainning a model like bert,my laptop is only a macbook and I have to use the old computer in my house to use cuda , so I do not have the full result of the Bert,but the code works fine just the speed is very slow which takes a long time to see a proper result) In BiLSTM, I use the glove vector and concat it with a 50-dim vector which uses it to describe the syntax label,so the word embedding in the BiLSTM is actually 150 dim.


Running Procedure:
1.Run make_data.py which you can adjust the size of training,val,test dataset,in the experiment I use training (2349 sentences),val (150 sentences),test (350 sentences)
2.Run each model's traininig script,like ner_Xgboost.py,Rf.py,CRF.py,Copy_Attn.py,it will save the model automatically
3.Run the vote_classifer.py to get the hard_voting result on the test dataset
4.To balance the dataset, I put a 'balanced' parameter in the sklearn's model, also I give a small value to the O labeled word embedding
Result:
On the test set,the result gives like below:

Labels precision recall f1-score support
B-art 0.500 0.100 0.167 10
B-eve 0.750 0.273 0.400 11
B-geo 0.726 0.869 0.791 335
B-gpe 0.877 0.758 0.813 198
B-nat 0.500 0.500 0.500 2
B-org 0.733 0.664 0.697 211
B-per 0.749 0.753 0.751 182
B-tim 0.916 0.840 0.877 169
I-art 0.000 0.000 0.000 6
I-eve 1.000 0.111 0.200 9
I-geo 0.753 0.753 0.743 75
I-gpe 0.000 0.000 0.000 6
I-nat 1.000 1.000 1.000 1
I-org 0.742 0.685 0.712 168
I-per 0.813 0.880 0.845 217
I-tim 0.750 0.488 0.592 43
O 0.990 0.994 0.992 9591
acc 0.958 0.957 0.960 11234
macro avg 0.694 0.567 0.593 11234
weighted avg 0.959 0.960 0.959 11234

We see that the classifer:
1.reach the weighted accuracy weight of 0.959 and non weighted accuracy 0.958
2.reach the f1 score of 0.96
3.due to the severly unbalanced dataset,like I-art or I-nat could only count for 20~40 counts in this dataset that contains more than 60,000 words,therefore, event I try to balance the dataset using upsampling and downsampling the result did not change quite much,due to the lack of data,all those models do not perform well on these extremely unbalanced labels.
Since I still have to finish my graduate paper and finish at least one implementation , I did not go deeper to solove this problem , but I already have several ideas in my mind like transform the CV's focal loss into this NER problem which focal loss is used to slove a unbalanced dataset classification problem. Additionally, using Syntax Tree Analysis we can give different score to each syntax label which could help us to better recongnize the Named Entity.Also,using Knowledege Base like Yago as a resource would increase the result too.The pretrained model like Bert,XLNet could also improve the result.


Script and file Descibe:
1.models_func.py,contains the all the model and function that used in the NER process
2.data_make.py, spilit the data under the num that you give
3.CRF,ner_Xgboost,Copy_Attn,Rf,Bert_BiLSTM_CRF are different scirpts that you can run
4.event_tensors is the Glove Vector Package,in order to use this package,you also need to download a glove vector,which I use a 100d Glove Vector. You can also choose to use a random word vector,however,based on the accuracy on the BiLSTM, using Glove vector will raise the accuracy for serveral points.
5.GMB_dataset.txt is the dataset used in this task
6.non_O_BiLSTM_CRF_constrained.py is the BiLSTM that does not use O labeled words, however,maybe due to the lack of context,the performance is bad, whichi I did not use it as a model.
7.crf1,crf2,xgb,rf,attn_bilstm_crf are model files which you can load using sklearn's joinlib or torch's torch.load
8.syntax_embeds is the embedding that trained to describe the syntax label which is used to cancat with word vector.
9.train data,val data,test data,are the datas that spilited using make_data.py


For Glove vector,due to its size,I can not upload it to the github,you can download it at https://nlp.stanford.edu/projects/glove/
I am still training a deeper BiLSTM model,since I did not adjust any hyperparameter (only a 1 layer BiLSTM with 256 hidden dims) to find a nice setting,the result still has space to raise since the BiLSTM model could be improved.

gmb_corpus_ner's People

Contributors

pososagapo avatar

Stargazers

 avatar Nischay Dhankhar avatar  avatar Linxiao Zhu avatar  avatar

Watchers

 avatar

Forkers

aiedward

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.