GMB_corpus_ner

This is the repository fot the link https://www.kaggle.com/shoumikgoswami/ner-using-random-forest-and-crf It uses a ensemble model combines (xgboost,crf,random forest,bilist+attn+crf) and a model based on Bert(since my computer does not support trainning a model like bert,my laptop is only a macbook and I have to use the old computer in my house to use cuda , so I do not have the full result of the Bert,but the code works fine just the speed is very slow which takes a long time to see a proper result) In BiLSTM, I use the glove vector and concat it with a 50-dim vector which uses it to describe the syntax label,so the word embedding in the BiLSTM is actually 150 dim.

Running Procedure:
1.Run make_data.py which you can adjust the size of training,val,test dataset,in the experiment I use training (2349 sentences),val (150 sentences),test (350 sentences)
2.Run each model's traininig script,like ner_Xgboost.py,Rf.py,CRF.py,Copy_Attn.py,it will save the model automatically
3.Run the vote_classifer.py to get the hard_voting result on the test dataset
4.To balance the dataset, I put a 'balanced' parameter in the sklearn's model, also I give a small value to the O labeled word embedding
Result:
On the test set,the result gives like below:

Labels	precision	recall	f1-score	support
B-art	0.500	0.100	0.167	10
B-eve	0.750	0.273	0.400	11
B-geo	0.726	0.869	0.791	335
B-gpe	0.877	0.758	0.813	198
B-nat	0.500	0.500	0.500	2
B-org	0.733	0.664	0.697	211
B-per	0.749	0.753	0.751	182
B-tim	0.916	0.840	0.877	169
I-art	0.000	0.000	0.000	6
I-eve	1.000	0.111	0.200	9
I-geo	0.753	0.753	0.743	75
I-gpe	0.000	0.000	0.000	6
I-nat	1.000	1.000	1.000	1
I-org	0.742	0.685	0.712	168
I-per	0.813	0.880	0.845	217
I-tim	0.750	0.488	0.592	43
O	0.990	0.994	0.992	9591
acc	0.958	0.957	0.960	11234
macro avg	0.694	0.567	0.593	11234
weighted avg	0.959	0.960	0.959	11234

We see that the classifer:
1.reach the weighted accuracy weight of 0.959 and non weighted accuracy 0.958
2.reach the f1 score of 0.96
3.due to the severly unbalanced dataset,like I-art or I-nat could only count for 20~40 counts in this dataset that contains more than 60,000 words,therefore, event I try to balance the dataset using upsampling and downsampling the result did not change quite much,due to the lack of data,all those models do not perform well on these extremely unbalanced labels.
Since I still have to finish my graduate paper and finish at least one implementation , I did not go deeper to solove this problem , but I already have several ideas in my mind like transform the CV's focal loss into this NER problem which focal loss is used to slove a unbalanced dataset classification problem. Additionally, using Syntax Tree Analysis we can give different score to each syntax label which could help us to better recongnize the Named Entity.Also,using Knowledege Base like Yago as a resource would increase the result too.The pretrained model like Bert,XLNet could also improve the result.

Script and file Descibe:
1.models_func.py，contains the all the model and function that used in the NER process
2.data_make.py, spilit the data under the num that you give
3.CRF,ner_Xgboost,Copy_Attn,Rf,Bert_BiLSTM_CRF are different scirpts that you can run
4.event_tensors is the Glove Vector Package,in order to use this package,you also need to download a glove vector,which I use a 100d Glove Vector. You can also choose to use a random word vector,however,based on the accuracy on the BiLSTM, using Glove vector will raise the accuracy for serveral points.
5.GMB_dataset.txt is the dataset used in this task
6.non_O_BiLSTM_CRF_constrained.py is the BiLSTM that does not use O labeled words, however,maybe due to the lack of context,the performance is bad, whichi I did not use it as a model.
7.crf1,crf2,xgb,rf,attn_bilstm_crf are model files which you can load using sklearn's joinlib or torch's torch.load
8.syntax_embeds is the embedding that trained to describe the syntax label which is used to cancat with word vector.
9.train data,val data,test data,are the datas that spilited using make_data.py

For Glove vector,due to its size,I can not upload it to the github,you can download it at https://nlp.stanford.edu/projects/glove/
I am still training a deeper BiLSTM model,since I did not adjust any hyperparameter (only a 1 layer BiLSTM with 256 hidden dims) to find a nice setting,the result still has space to raise since the BiLSTM model could be improved.

pososagapo / gmb_corpus_ner Goto Github PK

gmb_corpus_ner's Introduction

GMB_corpus_ner

gmb_corpus_ner's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent