Giter VIP home page Giter VIP logo

subword_study's Introduction

Subword-informed word representation training framework

We provide a general framework for training subword-informed word representations by varying the following components:

For the whole framework architecture and more details, please refer to the reference.

There are 4 segmentation methods, 3 possible ways of embedding subwords, 3 ways of enhancing with position embeddings, and 3 different composition functions.

Here is a full table of different options and their labels:

Component Option Label
Segmentation methods CHIPMUNK
Morfessor
BPE
Character n-gram
sms
morf
bpe
charn
Subword embeddings w/o word token
w/ word token
w/ morphotactic tag (only for sms)
-
ww
wp
Position embeddings w/o position embedding
addition
elementwise multiplication
-
pp (not applicable to wp)
mp (not applicable to wp)
Composition functions addition
single self-attention
multi-head self-attention
add
att
mtxatt

For example, sms.wwppmtxatt means we use CHIPMUNK as segmentation, insert word token into the subword sequence, enhance with additive position embedding, and use multi-head self-attention as composition function.

Subword segmentation methods

Taking the word dishonestly as an example, with different segmentation methods, the word will be segmented into the following subword sequence:

  • ChipMunk: (<dis, honest, ly>) + (PREFIX, ROOT, SUFFIX)
  • Morfessor: (<dishonest, ly>)
  • BPE (10k merge ops): (<dish, on, est, ly>)
  • Character n-gram (from 3 to 6): (<di, dis, ... , ly>, <dis, ... ,tly>, <dish, ... , stly>, <disho, ... , estly>)

where < and > are word start and end markers.

After the segmentation, we will obtain a subword sequence S for each segmentation method, and another morphortactic tag sequence T for sms.

Subword embeddings and position embeddings

We can embed the subword sequence S directly into subword embedding sequence by looking up in the subword embedding matrix, or insert a word token (ww) into S before embedding, i.e. for sms it will be (<dis, honest, ly>, <dishonestly>).

Then we can enhance the subword embeddings with additive (pp) or elementwise (mp) multiplication.

For sms, we can also embed the concatenation of the subword and its morphortactic tags (wp): (<dis:PREFIX, honest:ROOT, ly>:SUFFIX). And <dishonest>:WORD will be inserted if we choose ww. Note that position embeddings are not applicable to wp as a kind of morphological position information has already been provided.

Prerequisites

Install python packages and segmentation method packages

  • Make sure you are in subword_study folder
  • Run ./prereq.sh file
    • if file is not executing, run chmod +x prereq.sh and run above command again (Repeat this step for all the .sh files)
  • After this command the following files for Marathi should be generate in the 'toy_data/ma' folder:
    • ma.sent.1m
    • ma.sent.1m.5.word
    • ma.sent.1m.5.dict
    • ma.sent.1m.5.morf
    • ma.sent.1m.5.bpe

running training different config

  • cd code
  • ./run.sh ma encodingType config learningRate batch_size
  • the respective trained model will be generated in the ./code/outfiles/ folder

inferencing and plotting the results

  • cd eval
  • run ./eval.sh (it will take a while to execute)
  • result plot image and html should be generated in the present working directory

References

subword_study's People

Contributors

babyitachi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.