Giter VIP home page Giter VIP logo

token-rnn-tensorflow's Introduction

token-rnn-tensorflow

Token level RNN language model for any given code corpus.

Dependencies

  • python-3.4.3 (may work for previous version)
  • pygments
    • sudo pip3 install Pygments
  • numpy
    • sudo pip3 install numpy
  • Tensorflow 1.0.0
    • sudo pip3 install tensorflow

Getting Started

Before training the language model on a code corpus, it is necessary to tokenize the code first. Assuming that the corpus is located the directory corpus_dir and contains C code files, this can be achieved by the following

cd source
python3 utils/tokenize_corpus.py corpus_dir ".c" ../data/example/files

Doing so will store the tokenized files of the corpus in the directory ../data/example/files. Next we will need to convert this tokenized corpus into a single file that will be used as input to the language model. Following our example, this is done by

python3 utils/create_input_from_corpus.py ../data/example/files/ ".c" ../data/example/ .7 .15 .15 --vocab_size 100

Running this command will split the corpus into 70% training data, 15% validation data, and 15% testing data as well as produce the RNN LM input file for each set. In addition, the corresponding token types and the files used in each split are logged. Note to check all of the arguments by passing -h to utils/create_input_from_corpus.py. In ../../data/example you will find the following generated files.

files           test.txt         train.txt        valid.txt
rev             test_types.txt   train_types.txt  valid_types.txt
test_files.txt  train_files.txt  valid_files.txt

Since we specified a vocbulary size of 100, in train.txt, valid.txt, and test.txt the top 100 most frequent tokens in the corpus will appear verbatim and all other tokens will be replaced by the <UNK> token. A value of -1 for vocab_size indicates to make the vocabulary size equal to the number of unique tokens in the corpus.

Now we can train the model using the file train.txt as input. For brevity, many of the options for train.py are excluded.

python3 train.py ../data/example/ ../save/example

If we wanted to train a reverse reading language model we would instead use

python3 train.py ../data/example/rev ../save/example/rev

After training the model, we can generate code based on the language model by running

python3 sample.py ../save/example

token-rnn-tensorflow's People

Contributors

aalmendoza avatar martinvelez avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.