Giter VIP home page Giter VIP logo

thai2fit's Introduction

thai2fit (formerly thai2vec)

ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from fast.ai

Models and word embeddings can also be downloaded via Dropbox.

We pretrained a language model with 60,005 embeddings on Thai Wikipedia Dump (perplexity of 28.71067) and text classification (micro-averaged F-1 score of 0.60322 on 5-label classification problem. Benchmarked to 0.5109 by fastText and 0.4976 by LinearSVC on Wongnai Challenge: Review Rating Prediction. The language model can also be used to extract text features for other downstream tasks.

random word vectors

Dependencies

  • Python>=3.6
  • PyTorch>=1.0
  • fastai>=1.0.38

Version History

v0.1

  • Pretrained language model based on Thai Wikipedia with the perplexity of 46.61
  • Pretrained word embeddings (.vec) with 51,556 tokens and 300 dimensions
  • Classification benchmark of 94.4% accuracy compared to 65.2% by fastText for 4-label classification of BEST

v0.2

  • Refactored to use fastai.text instead of torchtext
  • Pretrained word embeddings (.vec and .bin) with 60,000 tokens and 300 dimensions (word2vec_examples.ipynb)
  • Classification benchmark of 0.60925 micro-averaged F1 score compared to 0.49366 by fastText and 0.58139 by competition winner for 5-label classification of Wongnai Challenge: Review Rating Prediction (ulmfit_wongnai.ipynb)
  • Text feature extraction for other downstream tasks such as clustering (ulmfit_ec.ipynb)

v0.3

  • Repo name changed to thai2fit in order to avoid confusion since this is ULMFit not word2vec implementation
  • Migrate to Pytorch 1.0 and fastai 1.0 API
  • Add QRNN-based models; inference time drop by 50% on average
  • Pretrained language model based on Thai Wikipedia with the perplexity of 46.04264 (20% validation) and 23.32722 (1% validation) (pretrain_wiki.ipynb)
  • Pretrained word embeddings (.vec and .bin) with 60,000 tokens and 400 dimensions (word2vec_examples.ipynb) based on QRNN
  • Classification benchmark of 0.60925 micro-averaged F1 score compared to 0.49366 by fastText and 0.58139 by competition winner for 5-label classification of Wongnai Challenge: Review Rating Prediction (ulmfit_wongnai.ipynb)
  • LSTM weights are copied from v0.2 according to guideline provided in fastai forum
I remember someone doing a script but I can’t find it. For both, you just have to map the old names of the weights to the new ones. Note that:

in language models, there is a bias in the decoder in fastai v1 that you probably won’t have
in the classifier, the order you see for the layers is artificial (it’s the pytorch representation that takes the things in the order you put them in __init__ when not using Sequential) but the two models (old and new) apply batchnorm, dropout and linear in the same order
tokenizing is done differently in fastai v1, so you may have to fine-tune your models again (we add an xxmaj token for words beginning with a capital for instance)
for weight dropout, you want the weights you have put both in '0.rnns.0.module.weight_hh_l0' and 0.rnns.0.weight_hh_l0_raw (the second one is copied to the first with dropout applied anyway)

v0.31

v0.32

  • Better text cleaning rules resulting in Thai Wikipedia Dump pretrained perplexity of 28.71067.

v0.4 (In Progress)

  • Replace AWD-LSTM/QRNN with tranformers-based models
  • Named-entity recognition

Text Classification

We trained the ULMFit model implemented bythai2fit for text classification. We use Wongnai Challenge: Review Rating Prediction as our benchmark as it is the only sizeable and publicly available text classification dataset at the time of writing (June 21, 2018). It has 39,999 reviews for training and validation, and 6,203 reviews for testing.

We achieved validation perplexity at 35.75113 and validation micro F1 score at 0.598 for five-label classification. Micro F1 scores for public and private leaderboards are 0.59313 and 0.60322 respectively, which are state-of-the-art as of the time of writing (February 27, 2019). FastText benchmark based on their own pretrained embeddings has the performance of 0.50483 and 0.49366 for public and private leaderboards respectively. See ulmfit_wongnai.ipynb for more details.

Text Feature Extraction

The pretrained language model of thai2fit can be used to convert Thai texts into vectors, after which said vectors can be used for various machine learning tasks such as classification, clustering, translation, question answering and so on. The idea is to train a language model that "understands" the texts then extract certain vectors that the model "thinks" represents the texts we want. You can access this functionality easily via pythainlp

from pythainlp.ulmfit import *
document_vector('วันนี้วันดีปีใหม่',learn,data)
>> array([ 0.066298,  0.307813,  0.246051,  0.008683, ..., -0.058363,  0.133258, -0.289954, -1.770246], dtype=float32)

Language Modeling

The goal of this notebook is to train a language model using the fast.ai version of AWD LSTM Language Model, with data from Thai Wikipedia Dump last updated February 17, 2019. Using 40M/200k/200k tokens of train-validation-test split, we achieved validation perplexity of 27.81627 with 60,004 embeddings at 400 dimensions, compared to state-of-the-art as of October 27, 2018 at 42.41 for English WikiText-2 by Yang et al (2018). To the best of our knowledge, there is no comparable research in Thai language at the point of writing (February 17, 2019). See thwiki_lm for more details.

Word Embeddings

We use the embeddings from v0.1 since it was trained specifically for word2vec as opposed to latter versions which garner to classification. The thai2vec.bin 51,556 word embeddings of 300 dimensions, in descending order by their frequencies (See thai2vec.vocab). The files are in word2vec format readable by gensim. Most common applications include word vector visualization, word arithmetic, word grouping, cosine similarity and sentence or document vectors. For sample code, see thwiki_lm/word2vec_examples.ipynb.

Word Arithmetic

You can do simple "arithmetic" with words based on the word vectors such as:

  • ผู้หญิง (female) + ราชา (king) - ผู้ชาย (male) = ราชินี (queen)
  • หุ้น (stock) - พนัน (gambling) = กิจการ (business)
  • อเมริกัน (american) + ฟุตบอล (football) = เบสบอล (baseball)

word arithmetic

Word Grouping

It can also be used to do word groupings. For instance:

  • อาหารเช้า อาหารสัตว์ อาหารเย็น อาหารกลางวัน (breakfast animal-food dinner lunch) - อาหารสัตว์ (animal-food) is type of food whereas others are meals in the day
  • ลูกสาว ลูกสะใภ้ ลูกเขย ป้า (duaghter daughter-in-law son-in-law aunt) - ลูกสาว (daughter) is immediate family whereas others are not
  • กด กัด กิน เคี้ยว (press bite eat chew) - กด (press) is not verbs for the eating process Note that this could be relying on a different "take" than you would expect. For example, you could have answered ลูกเขย in the second example because it is the one associated with male gender.

word grouping

Cosine Similarity

Calculate cosine similarity between two word vectors.

  • จีน (China) and ปักกิ่ง (Beijing): 0.31359560752667964
  • อิตาลี (Italy) and โรม (Rome): 0.42819627065839394
  • ปักกิ่ง (Beijing) and โรม (Rome): 0.27347283956785434
  • จีน (China) and โรม (Rome): 0.02666692964073511
  • อิตาลี (Italy) and ปักกิ่ง (Beijing): 0.17900795797557473

cosine similarity

Citation

@software{charin_polpanumas_2021_4429691,
  author       = {Charin Polpanumas and
                  Wannaphong Phatthiyaphaibun},
  title        = {thai2fit: Thai language Implementation of ULMFit},
  month        = jan,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.3},
  doi          = {10.5281/zenodo.4429691},
  url          = {https://doi.org/10.5281/zenodo.4429691}
}

NLP Workshop at Chiangmai University

thai2fit's People

Contributors

cstorm125 avatar wannaphong avatar

Stargazers

fr4nk avatar  avatar Chanakan Moongthin avatar Kanyakorn Jumangmor avatar Hildeberto avatar Wuttada avatar Suchit G avatar Pitiwat Lueangwitchajaroen avatar  avatar North avatar Saenyakorn Siangsanoh avatar Him avatar Napat Dollapavijit avatar  avatar Norapat Buppodom avatar Pathairush Seeda avatar Pakinwet Saksamerprom avatar Tanyapohn Patummasutr avatar Punsang K. avatar Jack Xu avatar Nonpavit Detbun avatar  avatar  avatar  avatar Pakin Siwathammarat avatar Midori avatar Praphan Oranphanlert avatar Nick Doiron avatar Mr.P L avatar Karan Thakkar avatar Napol Rachatasumrit avatar Shirish Kayastha avatar  avatar Tip Supoj avatar Daniel Law avatar xinyu avatar  avatar tyoc213 avatar Tanyawat Vittayapalotai avatar Amal Roy avatar  avatar Kasim avatar khomkrit Uparakham  avatar  avatar  avatar dorn avatar Kitisak Thossaensin avatar Supachai avatar Teerapong Taengmuang avatar Md Hasan avatar ChiChaChai avatar Winston Fan avatar atlonxp avatar zuoxiaolei avatar ZergJH avatar Isada avatar  avatar Punyapat Sessomboon avatar  avatar Héctor avatar Jarana Manoturmuksa avatar  avatar Yuan Qin avatar  avatar Albert Perrien II avatar KimmLee avatar Mickey Teerawath Maunsopah avatar Porramate Lim avatar Naomi Lin avatar chymgalois avatar Ravi Annaswamy avatar  avatar Usanisa Taoto avatar Knight H. avatar amano avatar Kesinee Boochuay avatar Un avatar  avatar Boyd Sorratat avatar Khomdet Phueadphut avatar  avatar Putt Kraidej avatar Oakyman avatar cuizixin avatar Samuraiwarm avatar Chrisada Sookdhis avatar Harin Sanghirun avatar Pattarawat Chormai avatar  avatar Lil Rubyboi avatar Thawatchai Nilphet avatar Mik avatar Yoga Yudistira avatar Thanapon Noraset avatar  avatar northeast250 avatar Kanakorn Horsiritham avatar Oleg Proskurin avatar Arunoda Susiripala avatar  avatar

Watchers

James Cloos avatar ake117 avatar amano avatar Ren avatar RAPRY avatar  avatar Supaseth avatar  avatar  avatar chaipat nengcomma avatar Bancherd avatar  avatar  avatar paper2code - bot avatar

thai2fit's Issues

thai2fit at subword level

thai2fit subword version

token level

  • newmm
  • sefr cut
  • ssg
  • sentencepiece (ask louise)

datasets for LM

  • wikipedia (ask louise)
  • prachathai67k
  • thaisum

downstream tasks

text clasificiation

  • wongnai_reviews
  • wisesight_sentiment
  • prachathai67k
  • generated_reviews_enth
  • thai_toxicity_tweet

Over Data Quota

Hello,

I tried to git lfs clone this repository for checking Thai2vec.vec but I got the error about the data quota of this repository as follow:

git lfs clone https://github.com/cstorm125/thai2vec.git
Cloning into 'thai2vec'...
remote: Counting objects: 68, done.
remote: Total 68 (delta 0), reused 0 (delta 0), pack-reused 68
Unpacking objects: 100% (68/68), done.
Git LFS: (0 of 8 files) 0 B / 706.32 MB
batch response: This repository is over its data quota. Purchase more data packs to restore access.

can you please share the large size files via other sources?

About the perplexity of the language model

I have just look through your code, but I haven't seen the code to evaluate the perplexity of the language model. So do you know how can I evaluate the the perplexity of the language model on my own data? Thanks.

Possible improvements

Hi krub P' Charin,

I just have few suggestions that might make the library easier to use:

  • Make it a libraries with load_model function so we can do from thai2vec import load_model which can load trained vector directly. It would be even better if we can also have word vector in binary format e.g. thai2vec.bin which can be loaded using fastText library.
  • It would be great to if you can add how you tokenize and clean the text in README.md i.e. using which libraries to tokenizer Thai text or how you replace digits. -- (I just saw that you use pythainlp, NVM!)
  • Multiple dimensions > 100d, 300d?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.