Giter VIP home page Giter VIP logo

astminer's Introduction

JetBrains Research CircleCI Download

astminer

A library for mining of path-based representations of code and more, supported by the Machine Learning Methods for Software Engineering group at JetBrains Research.

Supported languages of the input:

  • Java
  • Python
  • C/C++
  • Javascript (beta) (see issue)

Version history

0.5

  • Beta of Javascript support
  • Storage of ASTs in DOT format
  • Minor fixes

0.4

  • Support of code2vec output format
  • Extraction of ASTs and path-based representations of individual methods
  • Extraction of data for the task of method name prediction (code2vec paper)

0.3

0.2

  • Mining of ASTs

0.1

About

Astminer is an offspring of an internal utility from our ongoing research project.

Currently it supports extraction of:

  • Path-based representations of files
  • Path-based representations of methods
  • Raw ASTs

Supported languages are Java, Python, C/C++, but it is designed to be very easily extensible.

For the output format, see the section below.

Usage

Use as CLI

See a subfolder containing CLI and its description. It can be extended if needed.

Integrate in your mining pipeline

Import

Astminer is available in Bintray repo. You can add the dependency in your build.gradle file:

repositories {
    maven {
        url  "https://dl.bintray.com/egor-bogomolov/astminer" 
    }
}

dependencies {
    compile 'io.github.vovak.astminer:astminer:0.5'
}

If you use build.gradle.kts:

repositories {
    maven(url = "https://dl.bintray.com/egor-bogomolov/astminer/")
}

dependencies {
    compile("io.github.vovak.astminer", "astminer", "0.5")
}

Examples

If you want to use astminer as a library in your Java/Kotlin based data mining tool, check the following examples:

Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments.

Output format

For path-based representations, astminer supports two output formats. In both of them, we store 4 .csv files:

  1. node_types.csv contains numeric ids and corresponding node types with directions (up/down, as described in paper);
  2. tokens.csv contains numeric ids and corresponding tokens;
  3. paths.csv contains numeric ids and AST paths in form of space-separated sequences of node type ids;
  4. path_contexts.csv contains labels and sequences of path contexts (triples of two tokens and a path between them).

If the replica of code2vec format is used, each line in path_contexts.csv starts with a label, then it contains a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas.

If csv format is used, each line in path_contexts.csv contains label, then comma, then a sequence of ;-separated triples. Each triple contains start token id, path id, end token id, separated with spaces.

Other languages

Support for a new programming language can be implemented in a few simple steps.

If there is an ANTLR grammar for the language:

  1. Add the corresponding ANTLR4 grammar file to the antlr directory;
  2. Run the generateGrammarSource Gradle task to generate the parser;
  3. Implement a small wrapper around the generated parser. See JavaParser or PythonParser for an example of a wrapper.

If the language has a parsing tool that is available as Java library:

  1. Add the library as a dependency in build.gradle.kts;
  2. Implement a wrapper for the parsing tool. See FuzzyCppParser for an example of a wrapper.

Contribution

We believe that astminer could find use beyond our own mining tasks.

Please help make astminer easier to use by sharing your use cases. Pull requests are welcome as well. Support for other languages and documentation are the key areas of improvement.

Citing astminer

A paper dedicated to astminer (more precisely, to its older version PathMiner) was presented at MSR'19. If you use astminer in your academic work, please consider citing it.

@inproceedings{kovalenko2019pathminer,
  title={PathMiner: a library for mining of path-based representations of code},
  author={Kovalenko, Vladimir and Bogomolov, Egor and Bryksin, Timofey and Bacchelli, Alberto},
  booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
  pages={13--17},
  year={2019},
  organization={IEEE Press}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.