Giter VIP home page Giter VIP logo

codeprep's Introduction

Codeprep

Build Status Maintainability Test Coverage PyPI version fury.io

This is a tool for preprocessing source code corpora according to a specified vocabulary modeling choice.

Supported modeling choices are:

  • Splitting algorithm (no identifier splitting, camel-case splitting, snake-case splitting, BPE (byte-pair-encoding), number-splitting, ronin: http://joss.theoj.org/papers/10.21105/joss.00653);
  • Number of merges if using BPE;
  • Ignoring/preserving string literals;
  • Ignoring/preserving comments;
  • Preserving case/lowercasing;
  • Preserving/ignoring newlines and tabs.
  • applying/not applying stemming after basic splitting

Getting started

Make sure you have python >= 3.6 installed in your system; pip, setuptools and wheel are up to date.

python --version
python -m pip install --upgrade pip setuptools wheel

Install codeprep lib:

pip install codeprep

In order to run the ronin algorithm, you will have to additionally install Spiral module (https://github.com/casics/spiral/):

pip install git+https://github.com/casics/spiral.git

The tool can be used as a python library as well as a standalone module runnable with a CLI. You can pass the path to the dataset or the text itself to be preprocessed. When using Python API for the former option you need to import methods from codeprep.api.text module, for the latter - from codeprep.api.corpus. Below you can see the general patterns of usage.

Python API

>>> import codeprep.api.text as cp
>>> cp.<commmand>('Some code to be split')
>>> import codeprep.api.corpus as cp
>>> cp.<commmand>('/path/to/the/dataset')

CLI

codeprep <commmand> "Some code to be split"
codeprep <commmand> --path /path/to/the/dataset

Hereafter we will demonstrate the usage as a python library. The CLI is analogous to the python API. You can find the documentation about how to use it here.

Usage examples

Basic splitting

Tokenization + CamelCase- and snake_case- splitting:

>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
...     if (eps >= 0.345e+4) { // FIXME
...         printWord("     ...     Überraschung");
...     }
... }'''
>>> cp.basic(input_code)
['void', '<w>', 'test', '_', 'Word', 'Ueberraschung', 'Printer', '</w>', '(', ')', '{', '\n', 
'\t', 'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', 'FIXME', '\n', 
'\t', '\t', '<w>', 'print', 'Word', '</w>', '(', '"', '\t', '.', '.', '.', '\t', 'Überraschung', '"', ')', ';', '\n', 
'\t', '}', '\n', 
'}']

Tokenize but don't split identifiers

>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
...     if (eps >= 0.345e+4) { // FIXME
...         printWord("     ...     Überraschung");
...     }
... }'''
>>> cp.nosplit(input_code)
['void', 'test_WordUeberraschungPrinter', '(', ')', '{', '\n', 
'\t', 'if', '(', 'eps', '>', '=', '0', '.', '345e', '+', '4', ')', '{', '/', '/', 'FIXME', '\n', 
'\t', '\t', 'printWord', '(', '"', '\t', '.', '.', '.', '\t', 'Überraschung', '"', ')', ';', '\n', 
'\t', '}', '\n', 
'}']

BPE (Byte-Pair encoding)

The following code does camelCase- and snake_case- splitting and applies bpe with 10k merges on top:

>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
...     if (eps >= 0.345e+4) { // FIXME
...         printWord("     ...     Überraschung");
...     }
... }'''
>>> cp.bpe(input_code, bpe_codes_id='10k')
['v', 'oid</t>', 'test_', 'Word', 'U', 'eb', 'err', 'as', 'ch', 'un', 'g', 'Print', 'er</t>', '(</t>', ')</t>', '{</t>', '\n', 
'\t', 'i', 'f</t>', '(</t>', 'e', 'ps</t>', '></t>', '=</t>', '0</t>', '.</t>', '34', '5', 'e</t>', '+</t>', '4</t>', ')</t>', '{</t>', '/</t>', '/</t>', 'FIX', 'M', 'E</t>',  '\n', 
'\t', '\t', 'print', 'Word</t>', '(</t>', '"</t>', '\t', '.</t>', '.</t>', '.</t>', '\t', 'Ü', 'b', 'err', 'as', 'ch', 'un', 'g</t>', '"</t>', ')</t>', ';</t>', '\n', 
'\t', '}</t>', '\n', 
'}</t>']

codeprep by default does BPE using bpe codes leaned on the Github Java Corpus. The argument bpe_codes_id='10k' tells the codeprep tool to use 10,000 bpe merges. Other possible values are 1k and 5k (1,000 and 5,000 merges respectively). Please refer to section Learning custom BPE codes to train custom bpe codes.

For other commands and options like chars, --split-numbers, --ronin, --stem, please refer to the docs.

Calculate vocabulary

Set calc_vocab param to True when calling a preprocessing method to calculate the vocabulary of the preprocessed corpus, e.g.:

>>> import codeprep.api.corpus as cp
>>> cp.basic('/path/to/train/on', calc_vocab=True)
...
Vocab is available at /path/to/vocab

Learning custom BPE codes

If you don't want to use, pre-trained BPE codes, it's possible to train custom ones. For example, to train 10,000 merges on the corpus located at the path /path/to/train/on, the following command should be run (only CLI):

codeprep learn-bpe 10000 -p /path/to/train/on --id custom-bpe-codes 

Now it is possible to do bpe splitting by running the bpe command with the number of merges from 0 to 10,000 (for example with 3500 merges):

codeprep bpe custom-bpe-codes-3500 -p /path/to/preprocess 

Before bpe codes are trained, the basic preprocessing is done, which can also be tuned with arguments described in section Tweaking preprocessing.

Additional options

Tweaking preprocessing

You can pass the following parameters with a True value (default values for all of them are False), to tweak the way the imput is preprocessed:

  • no_str - replace strings with placeholders.
  • no_com - replace comments with placeholders.
  • no_spaces - remove newlines and tabs.
  • no_unicode - replace words containing non-ascii characters with placeholders.
  • no_case - lowercase words and encode information about case in tokens.
>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
...     if (eps >= 0.345e+4) { // FIXME
...         printWord("     ...     Überraschung");
...     }
... }'''
>>> cp.basic(input_code, no_spaces=True, no_unicode=True, no_case=True, no_com=True, no_str=True)
['void', '<w>', 'test', '_', '<Cap>', 'word', '<Cap>', 'ueberraschung', '<Cap>', 'printer', '</w>', '(', ')', '{', 
'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', '<CAPS>', 'fixme', 
'<w>', 'print', '<Cap>', 'word', '</w>', '(', '"', '.', '.', '.', '<Cap>', '<non-en>', '"', ')', ';', 
'}', 
'}']

Similar params can be specified as switches --no-str, --no-com, --no-spaces, --no-unicode, --no-case in CLI commands.

Specifying the language

Unless explicitely specified, codeprep will assume the language is java. To make sure the input is preprocessed as intended, it is always highly recommended to specify it:

import codeprep.api.text as cp
>>> cp.bpe("volatile", '1k')
['volatile']
>>> cp.bpe("volatile", '1k', extension="py")
['v', 'ol', 'a', 'ti', 'le</t>']
# Since 'volatile' is a keyword in java, it is represented as one token unlike in python 
# where it is pretty rare when used as an identifier and therefore represented as multiple subtokens.

When preprocessing a corpus, codeprep identifies the language based on the file extension. If you want only files with (a) certain extension(s) to be preprocessed, you can specify --ext param

codeprep basic --path /path/to/be/preprocessed --ext "java"

# or if you want to pre-process multiple types of files: 
codeprep basic --path /path/to/be/preprocessed --ext "java|c|py|js"

Miscellaneous

You can specify the path to where the preprocessed corpus will be written:

codeprep basic --path /path/to/preprocess --output-path /path/to/output

To print logs with log level DEBUG and higher to stdout:

codeprep basic --path /path/to/preprocess --verbose

Getting Help

To get help on commands and options:

codeprep --help

Paper

This library was build to run experiments for our paper accepted at ICSE 2020: Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

If you you the library or the results, please cite the paper:

@article{karampatsis2020big,
 title={Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code},
 author={Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea},
 journal={arXiv preprint arXiv:2003.07914},
 year={2020}
}

Advanced

Caching

When preprocessing a dataset, codeprep first parses source code and converts it into internal representation, which is after that converted to a preprocessed dataset depending on provided parameters. The intermediate representation is cached, so that when the same dataset is pre-processed again with different parameters, codeprep (providing no changes have been made to the dataset) would use the cache rather than parsing the source code again.

To store the cache, codeprep uses a directory speecified by $XDG_CACHE_HOME/codeprep/<codeprep_version> variable if its value is set, $HOME/.cache/codeprep/<codeprep_version> otherwise.

Removing the cache will not change the final result, however, will result in slower pre-processing.

Releases

1.0.3

  • Add more flixibility with versions of dependencies

1.0.1

  • Fix training custom bpe codes (Thanks to @mir-am)
  • Fix corpus pre-processing on Windows

1.0.0

  • DOI assigned

1.0.0-alpha.12

  • Bugfixes and minor improvements

1.0.0-alpha.11 (NOT backward compatible with 1.0.0-alpha.10)

  • Include token types in the metadata
  • Expand on token type hierarchy
  • Make possible to return full token index in the iterator

1.0.0-alpha.10 (NOT backward compatible with 1.0.0-alpha.9)

  • Add boundaries of comments to pre-processing metadata
  • Add Windows and OSx support
  • Switch from unittest to pytest+doctest
  • Bugfixes related to literal presentation of tokens on the disk
  • Bugfixes related to adding to mark the end of a full token

1.0.0-alpha.9 (NOT backward compatible with 1.0.0-alpha.7)

  • Add get_corpus_size() method to PreprocessedCorpus class
  • Do BPE splitting without splitting by convention first
  • Use to mark the last sub-token of a token
  • Replacing non-ascii sequences with a special char
  • Follow symlinks when reading a dataset
  • make possible to preserve case when doing stemming
  • Bugfixes

1.0.0-alpha.7 (NOT backward compatible with 1.0.0-alpha.6)

  • Store version in codeprep.__version__
  • implement --full-strings and --max-str-length options
  • replace ronin method/command wit--ronin option and apply ronin algorithm on word level instead of full identifier level
  • if split_numbers option is set to True, split numbers not only in code but also in strings and comments
  • change placeholder values to more human-readable
  • improve logging displaying
  • Bugfixes

1.0.0-alpha.6

Initial PyPI release

codeprep's People

Contributors

dependabot[bot] avatar hlibbabii avatar mir-am avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

codeprep's Issues

OSX support

Hello!

Thank you for your work it is amazing!
My name is Maksim Zubkov, I am an inter at JetBrains Research. I tried to use your tool for BPE tokenization of C++ code and got the following error:

OSError: Calculation of vocabulary is not supported on OSX.

Could you please explain why it is not currently supported, and what can I do to use your lib on MacOS?

Thank you!

Create PreppedTokenSequence class to incapsulate getting full tokens from subtokens

The tasks for the new PreppedTokenSequence class are to encapsulate getting full tokens from subtokens (which is currently done by FullTokenIterator class) and at the same time provide transparent access to the subtokens)

Motivation:

  • To get the full tokens, the user won't have to know about FullTokenIterator. This functionality can be provided by PreppedTokenSequence directly
  • ModelContext class is not really needed anymore

Provisional API:

>>> prepped_token_sequence = api.bpe("getName(", "5k")
>>> prepped_tokens
['get', 'Name', '</t>', (]
>>> prepped_tokens.metadata.token_types
[SplitContainer, OpeningBracket]
>>> prepped_tokens.metadata.n_subtokens_per_token
[3, 1]
>>> prepped_tokens.full_tokens()
[['get', 'Name', '</t>'], ['(']]
>>> prepped_tokens.full_tokens(formatter=lambda s: ''.join(s))
['getName</t>', '(']

PreprocessingMetadata enhancement

  • Rename PreprocessingMetadata -> PreppedTokenMetadata
  • Represent word_boundaries field as a list of the number of subtoken in each token, e.g
    [1, 3, 1, 2] instead of [0, 1, 4, 5, 7]
  • Remove non-processible tokens filed. Return non-processible tokens as a separate object
  • Provide a method for returning the metadata for the last tokens:
>>> metadata.for_last_tokens(n: int)

why use byte not str while in path (Windows)

def preprocess_and_write(params: Tuple[bytes, bytes, PrepConfig, str], bpe_data: Optional[BpeData] = None):

eh, I am working with this repository. on windows

I find when I use unicode like chinese in path like "./文档/", to_repr.py is likely to encode this string to bytes, this cause Exception.

unicode bytes like b'\xe6\x96\x87\xe6\xa1\xa3.py' which means ”文档.py“ , in Windows, it means a recursive folder. And python built-in function os.path.basename will not recognize this. When writing MetaData to file, this will raise a FileOrDirNotExist Exception

actually, I change the path to str to avoid this exception, but I dont know if there are any other side effects

Enhance `ParsedToken` hierarchy

  • rename SplitContainer to Identifier
  • make Identifier abstract and extend it with SingleWordIdentifier, TwoWordIdentifier, ThreeWordIdentifier, FourOrMoreWordIdentifier
  • make other classes that have sub-classes abstract

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.