Giter VIP home page Giter VIP logo

spt-code's Introduction

SPT-Code

Requirements

Minimize requirements

The list of minimize requirements can be found in requirements.txt.

Additional requirements

If you need to reprocess the raw dataset, or use your own dataset, then you will also need to install the following packages.

tree_sitter==0.19.0
antlr4-python3-runtime==4.9.2

Besides, antlr4 need to be installed, installation guidance here.

If you encounter errors about my-languages.so when preprocessing the dataset, please run sources/data/asts/build_lib.py first.

Datasets and Tokenizers

We provide pre-processed datasets, saved as pickle binary files, which can be loaded directly as instances.

The pre-processed datasets can be downloaded here: (OneDrive, iCloud, GoogleDrive). Put the downloaded dataset pickle file into {dataset_root}/dataset_saved/ (default to.../dataset/dataset_saved), the program will automatically detect and use it.

It is also possible to use a custom dataset, simply by placing it in the specified location according to the relevant settings in the source code, or by modifying the corresponding dataset loading script in the source code. The dataset loading code is located in the sources/data/data.py and sources/data/data_utils.py files.

Pre-trained Tokenizers and Models

Custom tokenizers (we call "vocab") can be downloaded here: (OneDrive, iCloud, Google Drive). Extract it in a certain directory. Specific the argument trained_vocab of main.py where the tokenizers are located or put it in {dataset_root}/vocab_saved (default to.../dataset/vocab_saved).

You may pre-train SPT-Code by yourself. We also provide pre-trained models available here. Extract and put it in a directory, then specific the argument trained_model like tokenizers before.

Runs

Run main.py to start pre-train, fine-tune or test. All arguments are located in args.py, specific whatever you need.

Some example scripts are as following.

# pre-training
python main.py \
--do-pre-train \
--pre-train-tasks cap,mass,mng \
--batch-size 64 \
--eval-batch-size 64 \
--cuda-visible-devices 0,1,2,3 \
--fp16 \
--model-name pre_train

# summarization on pre-trained model and vocab
python main.py \
--do-fine-tune \
--task summarization \
--summarization-language java \
--model-name summarization_java \
--trained_vocab '../pre_trained/vocabs/' \
--trained_model '../pre_trained/models/all/'

# bug fixing without pre-training
python main.py \
--do-fine-tune \
--train-from-scratch \
--task bug_fix \
--bug_fix_scale medium

# only test on translation
python main.py \
--only-test \
--task translation \
--translation-source-language java \
--translation-target-language c_sharp \
--trained_vocab '../pre_trained/vocabs/' \
--trained_model '../outputs/translation_java_c_sharp_20210826_052653/models/'

spt-code's People

Contributors

nougatca avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.