Giter VIP home page Giter VIP logo

thai2vec's Introduction

thai2vec

Language Modeling, Word2Vec and Text Classification in Thai Language. Created as part of pyThaiNLP.

We provide state-of-the-art language modeling (perplexity of 46.61 on Thai wikipedia) and text classification (94.4% accuracy on four-label classification problem. Benchmarked to 65.2% by fastText on NECTEC's BEST dataset). Credits to fast.ai.

random word vectors Random word vectors

Word Embeddings

The thai2vec.vec contains 51556 word embeddings of 300 dimensions, in descending order by their frequencies (See thai2vec.vocab). The files are in word2vec format readable by gensim. Most common applications include word vector visualization, word arithmetic, word grouping, cosine similarity and sentence or document vectors. For sample code, see examples.ipynb.

Word Arithmetic

You can do simple "arithmetic" with words based on the word vectors such as:

  • ผู้หญิง (female) + ราชา (king) - ผู้ชาย (male) = ราชินี (queen)
  • หุ้น (stock) - พนัน (gambling) = กิจการ (business)
  • อเมริกัน (american) + ฟุตบอล (football) = เบสบอล (baseball)

word arithmetic

Word Grouping

It can also be used to do word groupings. For instance:

  • อาหารเช้า อาหารสัตว์ อาหารเย็น อาหารกลางวัน (breakfast animal-food dinner lunch) - อาหารสัตว์ (animal-food) is type of food whereas others are meals in the day
  • ลูกสาว ลูกสะใภ้ ลูกเขย ป้า (duaghter daughter-in-law son-in-law aunt) - ลูกสาว (daughter) is immediate family whereas others are not
  • กด กัด กิน เคี้ยว (press bite eat chew) - กด (press) is not verbs for the eating process Note that this could be relying on a different "take" than you would expect. For example, you could have answered ลูกเขย in the second example because it is the one associated with male gender.

word grouping

Cosine Similarity

Calculate cosine similarity between two word vectors.

  • จีน (China) and ปักกิ่ง (Beijing): 0.31359560752667964
  • อิตาลี (Italy) and โรม (Rome): 0.42819627065839394
  • ปักกิ่ง (Beijing) and โรม (Rome): 0.27347283956785434
  • จีน (China) and โรม (Rome): 0.02666692964073511
  • อิตาลี (Italy) and ปักกิ่ง (Beijing): 0.17900795797557473

cosine similarity

Sentence/Document Vectors

One of the most immediate use cases for thai2vec is using it to estimate a sentence vector for text classification.

Language Modeling

Thai word embeddings and language model are trained using the fast.ai version of AWD LSTM Language Model--basically LSTM with droupouts--with data from Wikipedia (pulled on January 16, 2018). We achieved perplexity of 46.61 with 51556 embeddings (80/20 validation split; cut by pyThaiNLP), compared to state-of-the-art on November 17, 2017 at 40.68 for English language. To the best of our knowledge, there is no comparable research in Thai language at the point of writing (January 25, 2018). Details can be found in the notebook language_modeling.ipynb.

Text Classification

We follow Howard and Ruder (2018) approach on finetuning language models for text classification. The language model used is the one previously trained--the fast.ai version of AWD LSTM Language Model. The dataset is NECTEC's BEST, which is labeled as article, encyclopedia, news and novel. We preprocessed to remove the segmentation token and used an 80/20 split for training and validation. This resulted in 119241 sentences in the training and 29250 sentences in the validation set. We achieved 94.4% accuracy of four-label classification using the finetuning model as compared to 65.2% by fastText using their own pretrained embeddings.

To-do

  • Language modeling based on wikipedia dump
  • Extract embeddings and save as gensim format
  • Fine-tuning model for text classification on BEST
  • Benchmark text classification with FastText

thai2vec's People

Contributors

cstorm125 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.